Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND...

Preview:

Citation preview

Ghislain Fourny

Information Retrieval Spring 20193 Term vocabulary

2

What we have seen so far

3

Warm up

a

b

c

d

e

f

g

1 2 3 5 6 8

3 4 7 8 9

1 2 4 5 7

1 3 5 8 9

2 3 4 7

1 2 4 5 8 9

3 5 7 8

6

5

5

5

4

6

4

4

Boolean retrieval

InputSet of documents

5

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

query

6

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

2

What we have seen so far

3

Warm up

a

b

c

d

e

f

g

1 2 3 5 6 8

3 4 7 8 9

1 2 4 5 7

1 3 5 8 9

2 3 4 7

1 2 4 5 8 9

3 5 7 8

6

5

5

5

4

6

4

4

Boolean retrieval

InputSet of documents

5

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

query

6

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

3

Warm up

a

b

c

d

e

f

g

1 2 3 5 6 8

3 4 7 8 9

1 2 4 5 7

1 3 5 8 9

2 3 4 7

1 2 4 5 8 9

3 5 7 8

6

5

5

5

4

6

4

4

Boolean retrieval

InputSet of documents

5

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

query

6

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

4

Boolean retrieval

InputSet of documents

5

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

query

6

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

5

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

query

6

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

6

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Recommended