8/2/2019 Introduction to information retrieval
1/87
8/2/2019 Introduction to information retrieval
2/87
8/2/2019 Introduction to information retrieval
3/87
90
Web
Web
2004 Pew Fallows 2004
92%
Web
1990
Web Web
Web
SanfordStuttgart
8/2/2019 Introduction to information retrieval
4/87
II
7590
8
1
2
3
4
5
15 Boolean retrieval
6 7
8
8 921 9
10
XML HTML
6 11
12 11
12
1318 1315
13
14 6
Rocchio kNN
8/2/2019 Introduction to information retrieval
5/87
III
k nearest neighbor
15
1618 16
K EM
17
18
1921 Web 19 Web
Web 20
21 Web
Cross-language IRGrossman and Frieder 2004 4 Oard
and Dorr 1996
Image and multimedia IRGrossman and Frieder 2004
4 Baeza-Yates and Ribeiro-Neto 1999 611 12 del Bimbo 1999Lew
2001Smeulders et al.2000
Speech retrievalCoden et al.2002
Music retrievalDownie 2006 http://www.ismir.net/
User interfaces for IRBaeza-Yates and Ribeiro-
Neto 1999 10
Parallel and peer-to-peer IR P2P Grossman and Frieder 2004
7 Baeza-Yates and Ribeiro-Neto 1999 9 Aberer 2001
Digital librariesBaeza-Yates and Ribeiro-Neto 1999 15
Lesk 2004
Information science perspectiveKorfhage1997Meadow et al.1999Ingwersen and Jarvelin 2005
8/2/2019 Introduction to information retrieval
6/87
IV
Logic-based approaches to IRvan Rijsbergen 1989
Natural language processing techniquesManning and
Schutze 1999Jurafsky and Martin 2008Lewis and Jones 1996
21
15 6 7
810 11
11.1 1113 15
18
18.1 21
[*][**][***]
Lauren Cowles
Cheryl AasheimJosh AttenbergLuc BelangerTom BreuelDaniel Burckhardt
Georg BuscherFazli CanDinquan ChenErnest DavisPedro DomingosRodrigo Panchiniak
FernandesPaolo Ferragina Norbert FuhrVignesh GanapathyElmer GardunoXiubo Geng
David GondekSergio GovoniCorinna HabetsBen HandyDonna HarmanBenjamin Haskell
Thomas HuhnDeepak JainRalf JankowitschDinakar JayarajanVinay KakadeMei Kobayashi
Wessel KraaijRick LafleurFlorian LawsHang LiDavid MannEnnio MasiFrank McCown
Paul McNameeSven Meyer zu EissenAlexander MurzakuGonzalo NavarroScott Olsson
Daniel Paiva
Tao Qin
Megha Raghava
Ghulam Raza
Michal Rosen-Zvi
Klaus Rothenhausler
Kenyu L. RunnerAlexander SalamancaGrigory SapunovTobias SchefferNico Schlaefer
8/2/2019 Introduction to information retrieval
7/87
V
Evgeny ShadchnevIan SoboroffBenno SteinMarcin SydowAndrew TurnerJason UttHuey
VoTravis WadeMike WalshChangliang WangRenjing Wang Thomas Zeume
James AllanOmar AlonsoIsmail Sengor AltingovdeVo NgocAnhRoi
BlancoEric BreckEric BrownMark CarmanCarlos CastilloJunghoo ChoAron CulottaDoug
CuttingMeghana DeodharSusan DumaisJohannes FurnkranzAndreas HesDjoerd Hiemstra
David HullThorsten JoachimsSiddharth Jonathan J. B.Jaap KampsMounia LalmasAmy
LangvilleNicholas LesterDave LewisStephen LiuDaniel LowdYosi MassJeff Michels
Alessandro MoschittiAmir NajmiMarc NajorkGiorgio Maria Di NunzioPaul OgilviePriyank
PatelJan PedersenKathryn PedingsVassilis PlachourasDaniel RamageStefan RiezlerMichael
SchiehlenHelmut SchmidFalk Nicolas ScholerSabine Schulte im WaldeFabrizio Sebastiani
Sarabjeet SinghAlexander StrehlJohn TaitShivakumar VaithyanathanEllen VoorheesGerhard
WeikumDawid WeissYiming YangYisong YueJian Zhang Justin Zobel
Pavel BerkhinStefan ButtcherJamie CallanByron DomTorsten Suel Andrew Trotman
1314 15 Ray Mooney
Ray Mooney 3
Ray Mooney
C. D. Manning
P. Raghavan Yahoo!
H. Schutze
http://informationretrieval.org
Email [email protected]
8/2/2019 Introduction to information retrieval
8/87
Snippet XML
Prabhakar Raghavan
http://ir.ict.ac.cn/~wangbin/iir-book/
http://nlp.stanford.edu/IR-book/
information-retrieval-book.html
8/2/2019 Introduction to information retrieval
9/87
1
1.1 4
1.2 8
1.3 11
1.4 15
1.5 18
21
2.1 22
2.1.1 22
2.1.2 23
2.2 25
2.2.1 25
2.2.2 30
2.2.3 31
2.2.4 35
2.3 39
2.4 41
2.4.1 42
2.4.2 43
2.4.3 46
2.5 48
11
22
8/2/2019 Introduction to information retrieval
10/87
2
51
3.1 52
3.2 55
3.2.1 56
3.2.2 k-gram 57
3.3 59
3.3.1 59
3.3.2 60
3.3.3 61
3.3.4 k-gram 63
3.3.5 64
3.4 66
3.5 67
69
4.1 70
4.2 72
4.3 75
4.4 77
4.5 80
4.6 83
4.7 86
89
5.1 91
5.1.1 Heaps 93
5.1.2 Zipf 94
5.2 95
5.2.1 96
5.2.2 97
33
44
55
8/2/2019 Introduction to information retrieval
11/87
3
5.3 100
5.3.1 101
5.3.2 103
5.4 111
115
6.1 116
6.1.1 118
6.1.2 120
6.1.3 g 122
6.2 123
6.2.1 124
6.2.2 tf-idf 125
6.3 126
6.3.1 127
6.3.2 130
6.3.3
1316.4 tf-idf 133
6.4.1 tf 133
6.4.2 tf 133
6.4.3 134
6.4.4 135
6.5 139
141
7.1 142
7.1.1 K 143
7.1.2 144
7.1.3 145
7.1.4 145
7.1.5 147
66
77
8/2/2019 Introduction to information retrieval
12/87
4
7.1.6 147
7.2 149
7.2.1 150
7.2.2 150
7.2.3 151
7.2.4 152
7.3 153
7.3.1 154
7.3.2 154
7.3.3 155
7.4 155
157
8.1 158
8.2 160
8.3 161
8.4
1658.5 171
8.6 175
8.6.1 175
8.6.2 176
8.6.3 177
8.7 177
8.8 180
183
9.1 185
9.1.1 Rocchio 188
9.1.2 190
9.1.3 191
9.1.4 Web 193
88
99
8/2/2019 Introduction to information retrieval
13/87
5
9.1.5 193
9.1.6 194
9.1.7 195
9.1.8 195
9.2 196
9.2.1 196
9.2.2 196
9.2.3 198
9.3 200
XML 203
10.1 XML 206
10.2 XML 210
10.3 XML 215
10.4 XML 219
10.5 XML 223
10.6
225
229
11.1 230
11.2 232
11.2.1 1/0 232
11.2.2 233
11.3 233
11.3.1 235
11.3.2 237
11.3.3 239
11.3.4 240
11.4 242
11.4.1 242
11.4.2 243
1010
1111
8/2/2019 Introduction to information retrieval
14/87
6
11.4.3 Okapi BM25 244
11.4.4 IR 246
11.5 247
249
12.1 250
121.1 250
12.1.2 253
12.1.3 254
12.2 255
12.2.1 IR 255
12.2.2 256
12.2.3 Ponte Croft 259
12.3 262
12.4 LM 263
12.5 265
267
13.1 271
13.2 273
13.3 278
13.4 NB 280
13.5 286
13.5.1 287
13.5.2 2 290
13.5.3 292
13.5.4 292
13.5.5 293
13.6 294
13.7 300
1212
1313
8/2/2019 Introduction to information retrieval
15/87
7
303
14.1 305
14.2 Rocchio 307
14.3 k 311
14.4 316
14.5 321
14.6 323
14.7 330
333
15.1 334
15.2 341
15.2.1 341
15.2.2 343
15.2.3 344
15.2.4 347
15.3 348
15.3.1 349
15.3.2 351
15.4 ad hoc 355
15.4.1 355
15.4.2 357
15.5 359
363
16.1 365
16.2 368
16.3 370
16.4 K- 374
16.5 382
1414
1515
1616
8/2/2019 Introduction to information retrieval
16/87
8
16.6 387
391
17.1 393
17.2 396
17.3 403
17.4 405
17.5 407
17.6 409
17.7 410
17.8 412
17.9 414
417
18.1 418
18.2 - SVD 42218.3 424
18.4 LSI 427
18.5 432
Web 433
19.1 434
19.2 Web 436
19.2.1 Web 438
19.2.2 439
19.3 441
19.4 444
19.5 446
19.6 shingling 449
19.7 454
1717
1818
1919
8/2/2019 Introduction to information retrieval
17/87
9
Web 455
20.1 456
20.1.1 456
20.1.2 457
20.2 457
20.2.1 458
20.2.2 DNS 462
20.2.3 URL 463
20.3 466
20.4 467
20.5 470
473
21.1 Web 474
21.2 PageRank 476
21.2.1 478
21.2.2 PageRank 480
21.2.3 PageRank 483
21.3 Hub Authority 486
21.4 492
495
531
2020
2121
8/2/2019 Introduction to information retrieval
18/87
8/2/2019 Introduction to information retrieval
19/87
1
8/2/2019 Introduction to information retrieval
20/87
2
Information Retrieval IR
Web
unstructured data
structured
data
semistructured data
Java threading
clustering
information retrieval retrieval information retrieval
information
retrieval
search information retrievalsearch
information retrieval
8/2/2019 Introduction to information retrieval
21/87
3
classification
Web web search
Web
personal information
retrieval
MacOS X Spotlight Windows Vista
domain-specific search
Web
8/2/2019 Introduction to information retrieval
22/87
4
. .
. .
.
Shakespeares Collected Works
Brutus Caesar Calpurnia
Brutus
Caesar Calpurnia
grepping Unix
grep grepping
regular expression
00
()
() grep
Romans NEAR countrymen NEAR
()
index
,000
BrutusCaesar Calpurnia Brutus Marcus Brutus
Caesar
Julius Caesar Calpurnia Calpurnia
Pisonis
8/2/2019 Introduction to information retrieval
23/87
5
incidence matrix -term
. word
I-Hong Kong
- - td (t, d) 0
Brutus AND Caesar AND NOT Calpurnia
BrutusCaesar Calpurnia Calpurnia
complement AND
000 AND 0 AND 0 = 0000
Antony
and CleopatraHamlet -
term
Antony and Cleopatra
Julius CaesarThe TempestHamlet
OthelloMacbethAntony
Cleopetra
Antonyand
Cleopatra
JuliusCaesar
TheTempest
Hamlet Othello Macbeth ...
Antony 0 0 0 Brutus 0 0 0Caesar 0 Caplurnia 0 0 0 0 0Cleopatra 0 0 0 0 0mercy 0 worser 0 0...
8/2/2019 Introduction to information retrieval
24/87
8/2/2019 Introduction to information retrieval
25/87
7
pipeline leaks
pipeline rupture
effectiveness
precision
recall
-
0 00
- 000 0 00
0
000 00
-0 0 000000000 -
.%-0 /000 0
inver ted
index
-
dictionary vocabulary lexicon
dictionary - vocabulary
0 000
dictionary - vocabulary
8/2/2019 Introduction to information retrieval
26/87
8
posting
posting listinverted list
postings -
ID .
..
-
.
()
Friends, Romans, countrymen. So let it be with Caesar ...
() token
tokenization
Friends Romans countrymen So ...
() Friend roman countrymen So ...
()
ID ID
ID
token
token
Brutus
Caesar
Calpurnia
0
8/2/2019 Introduction to information retrieval
27/87
9
.
sort-based indexing
docID
ID
-
-
-
document frequency
docID
ad hoc
disk
singly linked list
. skip
listvariable length array
Unix Unix sort uniq
8/2/2019 Introduction to information retrieval
28/87
10
- ID
ID
term frequency
doc ID doc ID
I did enact Julius Caesar: I was
killed i' the Capitol; Brutus killed
me.
So let it be with Caeser. The noble
Brutus hath told you Caesar was
ambitious:
I
did
enatc
julius
caesar
I
was
killed
i'
the
capitol
burtus
killed
me
so
let
it
be
with
caesarthe
noble
brutus
hath
told
you
caesar
was
ambitious
I
I
i'
ambitious
be
brutus
brutus
capitol
caesar
caesar
caesardid
enact
hath
it
julius
killed
killed
let
menoble
so
the
the
told
you
was
was
with
with
was
you
told
the
so
noble
me
let
killed
julius
it
ambitious
be
brutus
capitol
caesar
did
enact
hath
I
i'
8/2/2019 Introduction to information retrieval
29/87
11
cache
offset
traverse
-
disk seek
1-1 [*] 1-3
new home sales top forecasts
home sales rise in july
increase in home sales in july
july new home sales rise
1-2 [*]
breakthrough drug for schizophrenia
new schizophrenia drug
new approach for treatment of schizophrenia
new hopes for schizophrenia patients
a.
b. -
1-3 [*] 1-2
a. schizophrenia AND drug
b. for AND NOT (drug OR approach)
.
simple conjunctive query
Brutus AND Calpurnia -
-
() Brutus
()
() Calpurnia
()
8/2/2019 Introduction to information retrieval
30/87
12
() -
- - Brutus Calpurnia
intersect ion
merge
merge algorithm
-
-
ID ID ID
ID x y
O(x + y) (N) N
(.) O(.)
Cormen et al.0
Brutus
Calpurnia
0
INTERSECT(p,p)
answer
whilep NIL. and p NIL
do ifdocI D( p) = doc I D (p)
then ADD(a'nswer, doc I D(p) )
pnext(p)
pnext(p)
else ifdoc. I D(p) < docI D(p)
thenpnext(p) elsepnext(p)
0 returnanswer
8/2/2019 Introduction to information retrieval
31/87
13
ID
(Brutus OR Caesar) AND NOT Calpurnia -
query optimization
t
Brutus AND Caesar AND Calpurnia -
- -
(Calpurnia AND Brutus) AND Caesar -
(madding OR crowd) AND (ignoble OR strife) AND (killed OR slain) -
OR AND
AND
8/2/2019 Introduction to information retrieval
32/87
14
-
-
bash
-
1-4 [*] O(x + y) x y Brutus
Caesar
a. Brutus AND NOT Caesarb. Brutus OR NOT Caesar
1-5 [*]
c. (Brutus OR Caesar) AND NOT (Antony OR Cleopatra
INTERSECT (t, ...,tn)
termsSORTBYINCREASINGFREQUENCY(t, ...,tn)
resultpostings(frst(terms))
termsrest(terms)
while termsNIL and result NIL
doresultINTERSECT(result,postings(frst(terms))) termsrest(terms)
returnresult
8/2/2019 Introduction to information retrieval
33/87
15
1-6 [**] AND OR
a. -
b.
c.
1-7 [*]
d. (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes
eyes
kaleidoscope 00
marmalade 0
skies
tangerine
trees
1-8 [*]
e. friends AND romans AND (NOT countrymen
countrymen
1-9 [**]
1-10 [**] x OR y 1-6
1-11 [**] x AND NOT y
.
ranked retrieval
model .
free text query
P-norm
8/2/2019 Introduction to information retrieval
34/87
16
0 0
0
ANDOR NOT
term proximityproximity
-
Westlaw Westlawhttp://www.
westlaw.com/ 0
00 Westlaw
Terms and Connectors
Westlaw Natural Language
Westlaw
Information on the legal theories involved in preventing the disclosure of trade
secrets by employees formerly employed by a competing company
trade secret/s disclos!/s prevent/s employe!
Requirements for disabled people
to be able to access a workplace
disab!/p access!/s work-site work-place (employment / place
Cases about a host's responsibility for
drunk guests
host!/p (responsib! liab!)/p (intoxicat! drunk!)/p guest
8/2/2019 Introduction to information retrieval
35/87
17
Web
0 Web
& AND, /s
/p /k k
phrase search .
! . liab!
liab work-site worksitework-site
work site ..
Westlaw
00
Westlaw Turtle,
Westlaw
AND
OR
()
() operating system
Westlaw
Gates NEAR Microsoft
()
8/2/2019 Introduction to information retrieval
36/87
18
()
ad hoc
ad hoc Web
Web
Web
1-12 [*] Westlaw professor
teacher lecturer explain
explain
1-13 [*] burglar
(i) burglar(ii) burglar AND burglar(iii) burglar OR burglar
(i) knight(ii) conquer(iii) knight OR conquer
.
0 0 Cleverdon; Liddy00
8/2/2019 Introduction to information retrieval
37/87
19
Bush
memex
memex
Information Retrieval Calvin Mooers 0
Mooers0
IBM
Taube and Wooster, H. P. Luhn
Mooers
George
Boole
ANDOR
Lee and Fox,
Witten Witten et al.
Zobel and Moffat, 00
Friedl 00 regular expression
Hopcroft et al. 000
8/2/2019 Introduction to information retrieval
38/87
8/2/2019 Introduction to information retrieval
39/87
531
A
A/B test A/B 177
Accents 32, 53
Access control lists 84
Accumulator 119, 132, 237
Accuracy 162, 285, 294, 295,
299, 373
Active learning 196, 350,
361
Add-one smoothing 275,
277, 278
Ad hoc retrieval ad hoc 6, 196,
264, 268, 282, 298, 315, 333, 334, 352,
353, 355, 358, 361
defined 2, 3, 4, 17, 22, 25, 26,
29, 42, 52, 55, 61, 62, 63, 79, 85, 90,
92, 94, 100, 104, 110, 118, 121, 125,
127, 133, 145, 160, 161, 162, 163, 165,
166, 167, 168, 170, 173, 180, 188, 195,
204, 208, 211, 213, 215, 216, 217, 219,
220, 221, 230, 234, 235, 237, 259, 262,
264, 271, 272, 276, 280, 283, 284, 287,
290, 292, 300, 301, 307, 308, 317, 318,
319, 321, 324, 325, 327, 329, 334, 335,
336, 341, 344, 345, 346, 352, 353, 358,
359, 365, 368, 369, 371, 372, 374, 377,
379, 382, 383, 386, 387, 392, 393, 396,
397, 400, 404, 405, 407, 409, 412, 413,
424, 427, 437, 438, 443, 446, 450, 451,
465, 479, 480, 481, 487, 492
evaluation of 7, 46, 68, 145,
155, 157, 158, 159, 160, 161, 163, 165,
167, 169, 170, 171, 173, 174, 175, 176,
177, 180, 181, 182, 193, 194, 200, 203,
206, 212, 219, 221, 222, 225, 226, 233,
260, 267, 271, 294, 295, 297, 298, 301,
324, 347, 353, 360, 363, 365, 369, 370,
371, 372, 374, 386, 388, 415, 432, 445,
449, 471
machine learning methods
22, 42, 152, 155, 174, 226, 298,
333, 334, 347, 348, 350, 355, 361, 482
Adjacency tables 468, 486
Adjusted Rand index
388
Adversarial information retrieval
441
Akaike information criterion (AIC)
381
Algebra, linear, review
Algorithmic search
Anchor text 264, 411, 412,
454, 459, 474, 475, 476, 483, 489, 492
Any-of classification 322,
323, 332
Auxiliary index 81, 82, 85
Average-link clustering
403
B
Back queues 464, 465, 466
Bag of words model
Unigram language model 123, 124,
127, 254, 282, 283, 324, 355
defined 2, 3, 4, 17, 22, 25, 26,
29, 42, 52, 55, 61, 62, 63, 79, 85, 90,92, 94, 100, 104, 110, 118, 121, 125,
127, 133, 145, 160, 161, 162, 163, 165,
166, 167, 168, 170, 173, 180, 188, 195,
204, 208, 211, 213, 215, 216, 217, 219,
220, 221, 230, 234, 235, 237, 259, 262,
264, 271, 272, 276, 280, 283, 284, 287,
290, 292, 300, 301, 307, 308, 317, 318,
319, 321, 324, 325, 327, 329, 334, 335,
336, 341, 344, 345, 346, 352, 353, 358,
8/2/2019 Introduction to information retrieval
40/87
532
359, 365, 368, 369, 371, 372, 374, 377,
379, 382, 383, 386, 387, 392, 393, 396,
397, 400, 404, 405, 407, 409, 412, 413,
424, 427, 437, 438, 443, 446, 450, 451,465, 479, 480, 481, 487, 492
Balanced F measure F F
measure F 163
Bayes error rate 316, 331
Bayesian networks 230, 246,
247
Bayesian prior 238, 240
Bayesian smoothing 265
Bayes Optimal Decision Rule 232
Bayes risk 233
Bayes Rule 231, 234, 280
Bernoulli model 259, 260,
267, 278, 279, 280, 281, 282, 283, 284,
285, 287, 289, 290, 292, 300, 383, 384
Best-merge persistence
400, 402, 404, 405, 413
Bias 254, 264, 286, 303, 305,323, 324, 325, 326, 327, 328, 331, 335,
349
Bias-variance tradeoff -
254, 286, 327, 335
Biclustering 389
Bigram language model
Binary Independence Model (BIM)
Binary search tree Biword indexes 42, 43,
44, 46
Blind relevance feedback
184, 194
Blocked sort-based indexing algorithm
BSBI 73,
74, 75, 76, 77, 78, 86
Blocked storage described
Blogs
BM25 weights BM25 230, 242,
244
Boolean retrieval 1, 4, 6, 11,15, 17, 19, 28, 38, 83, 113, 119, 123,
154, 204
model 4, 6, 9, 11, 15, 16, 17,
18, 19, 29, 33, 67, 68, 87, 92, 94, 105,
108, 111, 115, 116, 123, 124, 126, 127,
141, 142, 149, 151, 152, 153, 154, 178,
182, 188, 189, 191, 192, 200, 203, 204,
205, 206, 207, 215, 222, 223, 224, 226,
229, 230, 231, 232, 233, 234, 235, 237,238, 239, 240, 242, 243, 244, 246, 249,
250, 251, 252, 253, 254, 255, 256, 257,
139, 247, 258, 257, 258, 259, 260, 261,
262, 263, 264, 265, 267, 273, 274, 277,
278, 279, 280, 281, 282, 283, 284, 285,
286, 287, 289, 290, 292, 293, 300, 301,
303, 304, 305, 306, 323, 324, 326, 327,
328, 334, 335, 336, 338, 339, 343, 349,
355, 359, 363, 368, 380, 381, 382, 383,384, 386, 387, 389, 415, 432, 433, 440,
441, 442
principles 210, 229, 230, 232,
233, 265, 291, 406
query processing 9, 15,
17, 28, 30, 31, 38, 57, 82, 83, 99, 108,
113, 139, 147, 148, 151, 155, 466, 467,
470
ranked retrieval vs.
tokenization 8, 22, 25, 26,
27, 28, 34, 42, 95, 152, 230, 314
vector space model interactions
Boosting
Bottom-up cluster ing
. hierarchical agglomerative
clustering (HAC)
8/2/2019 Introduction to information retrieval
41/87
533
393, 410
Bowtie structure 439
Break-even point
168BSBI (b locked sor t -based indexing
algorithm)
73, 74, 75, 76, 77, 78, 86
Buckshot algorithm Buckshot
378, 414
Buffer 71, 74, 181
C
Caching 71, 90, 109
compression and in search systems 90
variable length arrays and
Capitalization 32, 33
Capture-recapture method
Cardinality in clustering
Case-folding 33, 73, 91,
92, 93, 100
CAS topics CAS 220, 222, 224
Category 3, 22, 25, 161, 162,
172, 196, 238, 256, 268, 269, 270, 271,
272, 273, 274, 275, 277, 279, 280, 281,
282, 283, 284, 286, 287, 288, 289, 290,
291, 292, 293, 294, 295, 296, 297, 298,
299, 300, 304, 305, 307, 308, 309, 310,
311, 312, 313, 314, 316, 317, 318, 319,
320, 321, 322, 323, 324, 325, 326, 327,
328, 329, 330, 332, 334, 335, 336, 339,
340, 343, 344, 348, 349, 350, 351, 352,
357, 358, 360, 364, 371, 372, 379, 386,
388, 413, 435, 444, 445
Centroid-based classification
331
Centroids 188, 189, 190, 192,
304, 306, 307, 308, 309, 310, 311, 314,
327, 329, 331, 369, 374, 375, 376, 377,
378, 379, 381, 382, 384, 386, 388, 391,
392, 396, 400, 401, 404, 405, 406, 407,
409, 411, 412, 413, 414, 415, 432
HAC 392, 393 ,395, 396, 399, 400, 401, 403, 404, 405,
406, 408, 409, 410, 412, 413, 414
Rocchio classification Rocchio
307, 308, 309, 310, 311, 316, 329, 331,
374
Chaining in clustering
399
Chain rule 231
Champion lists 145, 146, 149,155
Character sequence decoding
21, 22, 23
2
feature selection 2 291,
292
Chinese 12, 29, 35, 48, 49, 67,
100, 124, 129, 142, 146, 147, 150, 174,
178, 207, 215, 244, 276, 305, 307, 319,
327, 334, 358, 368, 374, 407, 430, 453Class boundary 311, 319,
320
Classes 3, 22, 25, 161, 162, 172,
196, 238, 256, 268, 269, 270, 271, 272,
273, 274, 275, 277, 279, 280, 281, 282,
283, 284, 286, 287, 288, 289, 290, 291,
292, 293, 294, 295, 296, 297, 298, 299,
300, 304, 305, 307, 308, 309, 310, 311,
312, 313, 314, 316, 317, 318, 319, 320,321, 322, 323, 324, 325, 326, 327, 328,
329, 330, 332, 334, 335, 336, 339, 340,
343, 344, 348, 349, 350, 351, 352, 357,
358, 360, 364, 371, 372, 379, 386, 388,
413, 435, 444, 445
maximum a posteriori
C l a s s i f i c a t i o n T e x t
classification 3, 22, 25, 26, 27, 28,
126, 134, 158, 159, 160, 161, 162, 184,
8/2/2019 Introduction to information retrieval
42/87
534
191, 197, 230, 260, 267, 268, 269, 270,
271, 272, 273, 275, 276, 277, 278, 279,
280, 282, 284, 285, 286, 287, 288, 289,
290, 292, 293, 294, 295, 296, 297, 298,299, 300, 303, 304, 305, 306, 307, 308,
309, 310, 311, 312, 313, 314, 315, 316,
317, 301, 318, 317, 318, 319, 320, 321,
322, 323, 324, 325, 326, 327, 328, 329,
330, 331, 332, 333, 334, 335, 336, 337,
339, 340, 341, 342, 343, 344, 345, 347,
348, 349, 350, 351, 352, 353, 354, 355,
356, 357, 358, 360, 361, 364, 370, 374,
380, 382, 384, 412, 427, 435, 436, 454,471, 483
any-of 272, 305, 321, 322,
323, 332
centroid-based 304, 331
k N N k n e a r e s t n e i g h b o r
classification (kNN) 284, 297, 298,
301, 304, 305, 306, 311, 312, 313, 314,
315, 316, 320, 321, 323, 326, 327, 329,
331, 343, 348, 349, 412multivalue 321
one-of 272, 299, 305, 321,
322, 323, 331
one-versus-all 343
R o c c h i o R o c c h i o
Rocchio classification 307, 308,
309, 310, 311, 316, 329, 331, 374
Classification function 271,
272, 304, 321, 339, 340Classifiers 22, 162, 191, 230,
268, 270, 271, 272, 275, 276, 278, 279,
284, 285, 286, 287, 289, 292, 293, 294,
295, 296, 297, 298, 299, 300, 301, 303,
305, 306, 310, 311, 312, 314, 315, 316,
317, 318, 319, 320, 321, 322, 323, 324,
325, 326, 327, 328, 329, 330, 331, 332,
334, 335, 336, 337, 339, 340, 342, 343,
344, 345, 347, 349, 350, 351, 352, 353,
354, 355, 356, 357, 358, 360, 361
choosing 18, 23, 25, 26, 30, 34,
36, 37, 40, 44, 52, 57, 60, 62, 64, 73,
74, 78, 81, 83, 85, 86, 106, 111, 117,122, 128, 130, 131, 144, 146, 148, 149,
154, 164, 168, 172, 173, 174, 177, 178,
181, 186, 196, 208, 210, 218, 224, 226,
245, 256, 262, 267, 271, 273, 274, 280,
282, 283, 284, 285, 286, 287, 288, 289,
290, 291, 292, 293, 298, 300, 301, 304,
312, 314, 318, 319, 323, 325, 327, 331,
335, 343, 344, 347, 349, 350, 351, 352,
354, 357, 361, 367, 370, 375, 376, 378,379, 380, 381, 382, 383, 386, 389, 392,
393, 395, 396, 403, 406, 409, 410, 411,
413, 414, 415, 435, 437, 438, 440, 441,
446, 447, 448, 449, 453, 457, 459, 464,
465, 466, 469, 470, 474, 475, 477, 478,
484, 487, 489, 492
performance improving
3, 9, 10, 11, 13, 18, 37, 39, 40, 46,
48, 49, 58, 62, 77, 90, 108, 109, 112,131, 135, 139, 149, 165, 173, 179, 182,
185, 186, 189, 190, 192, 193, 194, 195,
196, 198, 199, 219, 223, 238, 265, 271,
285, 286, 288, 289, 315, 322, 334, 335,
343, 347, 350, 351, 352, 353, 354, 360,
366, 367, 368, 388, 412, 430, 431, 436,
440, 444, 448, 449, 450, 457, 474, 476,
493
two-class 3, 101, 135, 139,172, 178, 184, 268, 281, 294, 296, 307,
309, 310, 317, 319, 320, 326, 327, 331,
335, 337, 339, 343, 346, 373, 386, 435,
456, 487
CLEF collection CLEF
Click spam 443
Clickstream mining 177,
195
Clickthrough log analysis
8/2/2019 Introduction to information retrieval
43/87
535
177, 181
Cliques 246, 398
Cloaking, in spamming
Cluster-based classification 331
Cluster hypothesis 365, 367,
368, 387
Clustering 2, 3, 26, 126, 147,
148, 265, 309, 352, 363, 364, 365, 366,
367, 368, 369, 370, 371, 372, 373, 374,
376, 377, 378, 379, 380, 381, 382, 383,
384, 385, 386, 387, 388, 389, 391, 392,
393, 394, 395, 396, 397, 398, 399, 400,401, 402, 403, 404, 405, 406, 407, 408,
409, 410, 411, 412, 413, 414, 415, 418,
427, 431, 432, 453, 467
average-link 403
cardinality in 30, 110, 181, 192,
223, 226, 232, 242, 285, 286, 301, 314,
334, 379, 434
centroid-based 304, 331
chaining in 399, 409complete-link HAC HAC
divisive 391, 392, 409, 410,
415
exclusive vs. exhaustive
flat Flat clustering
363, 364, 365, 368, 370, 374, 379, 392,
394, 395, 409, 410, 413, 414
group-average agglomerative 391, 400, 403
hard 19, 40, 69, 70, 71, 75, 77,
78, 87, 90, 96, 112, 163, 179, 260, 331,
347, 365, 368, 369, 381, 383, 384, 386,
437, 466, 490
hierarchical Hierarchical
clustering 150, 151, 152, 153, 210,
221, 272, 351, 360, 391, 392, 394, 395,
407, 409, 410, 415, 434, 435, 446
minimum variance 414
model-based 264, 363,
382, 383, 415
optimal 13, 15, 40, 103, 104,105, 110, 112, 188, 232, 233, 245, 259,
277, 293, 298, 301, 314, 316, 323, 324,
325, 331, 334, 339, 340, 342, 370, 376,
378, 379, 380, 386, 389, 391, 392, 407,
408, 409, 414, 469
overview 19, 455, 456
single-link HAC HAC
spectral 415
top-down 211, 392, 393,409, 410
Clusters 70, 77, 80, 466
pruning 155
Co-clustering 389
Collections 2, 4, 6, 9, 11, 12,
15, 17, 18, 25, 30, 34, 44, 45, 47, 52,
58, 60, 65, 70, 72, 73, 74, 75, 77, 78,
79, 80, 81, 85, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99, 100, 101, 106, 108, 109,111, 112, 116, 117, 119, 121, 124, 125,
128, 129, 130, 131, 135, 136, 137, 138,
144, 148, 151, 152, 158, 159, 160, 161,
162, 165, 167, 168, 169, 170, 171, 173,
174, 175, 176, 178, 181, 184, 185, 188,
189, 191, 192, 194, 196, 198, 199, 200,
204, 207, 208, 210, 212, 213, 214, 216,
217, 219, 220, 224, 225, 232, 234, 235,
237, 238, 239, 240, 241, 242, 244, 246,255, 256, 258, 259, 260, 261, 262, 264,
268, 271, 277, 288, 292, 293, 296, 298,
300, 304, 307, 311, 312, 314, 315, 328,
358, 359, 365, 366, 367, 368, 370, 371,
373, 377, 383, 388, 392, 395, 397, 400,
409, 412, 413, 415, 418, 424, 427, 428,
431, 434, 457, 467
clustering 2, 3, 26, 126, 147,
148, 265, 309, 352, 363, 364, 365, 366,
8/2/2019 Introduction to information retrieval
44/87
536
367, 368, 369, 370, 371, 372, 373, 374,
376, 377, 378, 379, 380, 381, 382, 383,
384, 385, 386, 387, 388, 389, 391, 392,
393, 394, 395, 396, 397, 398, 399, 400,401, 402, 403, 404, 405, 406, 407, 408,
409, 410, 411, 412, 413, 414, 415, 418,
427, 431, 432, 453, 467
frequency 9, 10, 13, 14, 15,
18, 30, 43, 45, 46, 47, 53, 65, 71, 72,
79, 83, 86, 91, 92, 94, 95, 96, 97, 100,
104, 106, 108, 113, 115, 123, 124, 125,
129, 131, 132, 133, 134, 135, 138, 139,
149, 155, 158, 176, 177, 178, 190, 192,213, 217, 218, 219, 234, 238, 239, 241,
244, 245, 246, 257, 258, 261, 262, 263,
274, 275, 279, 285, 286, 290, 292, 293,
294, 304, 315, 359, 372, 412, 415, 424,
439, 448, 456, 457, 479, 481
residual defined
statistics 7, 9, 83, 85, 91,
116, 191, 198, 213, 218, 219, 220, 256,
355, 467large 96, 111
Combination schemes
46
Combination similarity
393, 394, 398, 407, 408, 409
Complete-linkage clustering
391, 396, 397, 398, 399, 402, 403,
404, 408
Complete-link clustering 406
Component coverage 220,
221
Compound nouns 28, 360
Compound-splitter 28
Compression 7, 14, 23, 30, 40,
44, 45, 70, 71, 73, 75, 77, 83, 86, 89,
90, 91, 92, 94, 95, 96, 97, 98, 99, 100,
101, 102, 97, 98, 99, 100, 101, 102,
103, 104, 105, 106, 107, 108, 109, 110,
111, 112, 109, 112, 113, 99, 169, 468,
104, 105, 106, 107, 108, 112, 109, 110,
111, 112, 113, 103of dictionaries 89, 91, 94,
95, 98, 100, 109, 112
of docIDs ID
lossless/lossy
parameter-free
parameterized
of postings list 91
Compression/indexes
Heaps law Heaps 93, 111, 315overview 19, 455, 456
Zipfs law Zipf 94, 95, 106,
108, 111, 439
Concept drift 284, 285, 298,
301
Conditional independence assumption
235, 281, 282, 283
Confusion matrix 322, 323,
386Connected components 398
Connectivity queries 467,
468, 470
Connectivity servers 455,
467, 471
Content management systems
70, 87
Content seen module
461Context, XML XML 216, 217,
218
Context resemblance
216, 219
Contiguity hypothesis 304,
311, 365
Continuation bit 101, 102
Corpus 6, 30, 72, 73, 74, 75,
102, 161, 294
8/2/2019 Introduction to information retrieval
45/87
537
Cosine similarity 127,
128, 129, 130, 138, 139, 142, 143, 144,
145, 147, 148, 150, 188, 217, 243, 306,
309, 313, 329, 355, 356, 369, 387, 393,404, 427, 428, 476
CO topics CO 220, 224
CPC (cost per click)
442
CPM (cost per mil)
441
Cranfield collection Cranfield
Cross-entropy 264
Cross-language information retrieval 161, 432, 490
Cumulative gain 169, 181
D
Databases 2, 70, 87, 204, 205,
206, 223, 224, 226, 227, 437
communication with
re la t ional 204, 223,
224, 226, 227
-codes 103, 110, 112
Decision boundaries 307,
317, 319, 320, 326, 327, 328, 329, 334,
336, 337, 339, 341, 347, 357, 359
defined 2, 3, 4, 17, 22, 25, 26,
29, 42, 52, 55, 61, 62, 63, 79, 85, 90,
92, 94, 100, 104, 110, 118, 121, 125,
127, 133, 145, 160, 161, 162, 163, 165,
166, 167, 168, 170, 173, 180, 188, 195,
204, 208, 211, 213, 215, 216, 217, 219,
220, 221, 230, 234, 235, 237, 259, 262,
264, 271, 272, 276, 280, 283, 284, 287,
290, 292, 300, 301, 307, 308, 317, 318,
319, 321, 324, 325, 327, 329, 334, 335,
336, 341, 344, 345, 346, 352, 353, 358,
359, 365, 368, 369, 371, 372, 374, 377,
379, 382, 383, 386, 387, 392, 393, 396,
397, 400, 404, 405, 407, 409, 412, 413,
424, 427, 437, 438, 443, 446, 450, 451,
465, 479, 480, 481, 487, 492
Decision hyperplanes 305,
317, 319, 320, 335, 336Decision trees 297, 300, 334,
350
Dendrograms 393, 394, 397,
398, 406, 413
complete-link clustering
391, 396, 397, 398, 399, 402, 403,
404, 408
described 61, 142, 144, 160,
168, 198, 204, 205, 207, 209, 222, 224,231, 242, 260, 275, 285, 301, 310, 320,
322, 324, 349, 363, 367, 368, 378, 381,
396, 404, 412, 432, 454, 468, 471, 474,
475, 476, 491
Development sets 298, 349
Development test collection
159, 245
Diacritics 32, 33
Dice coefficient Dice 170Dictionaries 7, 8, 9, 10, 11, 13,
17, 21, 22, 25, 26, 28, 29, 31, 34, 35,
38, 51, 52, 53, 54, 55, 57, 58, 59, 60,
75, 76, 77, 80, 86, 89, 90, 91, 92, 94,
95, 96, 97, 98, 48, 67, 97, 98, 99, 100,
109, 112, 117, 118, 125, 143, 145, 154,
184, 196, 197, 198, 199, 200, 201, 254,
260, 264, 283, 448, 466, 468, 469
compression of 89, 91,94, 95, 98, 100, 109, 112
in inverted indexes
52
search structures for
Differential cluster labeling
410
Digital libraries 204
Discrete-time stochastic processes
478
8/2/2019 Introduction to information retrieval
46/87
538
Disk seek 11, 81
Distortion 380, 381, 395
Distributed crawling 461,
466, 471Distributed index 69, 70,
77, 78, 80, 86, 455, 456, 466, 471
Distributed indexing
Distributed information retrieval
Divisive clustering 391,
409, 410, 415
DNS resolution DNS 462, 463,
471DNS resolution module DNS
DNS server DNS 462, 463
DocIDs ID 9, 10, 41, 45, 79, 82,
83, 92, 100, 101, 102, 109, 112, 113,
121, 146, 147, 150, 151, 261, 276, 299,
356, 385
compression of ID
in inverted indexes
IDin postings list intersection operations
Document-at-a-time scoring
Document co l lec t ion
Collections 2, 4, 6, 9, 11, 12, 15, 17,
18, 25, 30, 34, 44, 45, 47, 52, 58, 60,
65, 70, 72, 73, 74, 75, 77, 78, 79, 80,
81, 85, 90, 91, 92, 93, 94, 95, 96, 97,98, 99, 100, 101, 106, 108, 109, 111,
112, 116, 117, 119, 121, 124, 125, 128,
129, 130, 131, 135, 136, 137, 138, 144,
148, 151, 152, 158, 159, 160, 161, 162,
165, 167, 168, 169, 170, 171, 173, 174,
175, 176, 178, 181, 184, 185, 188, 189,
191, 192, 194, 196, 198, 199, 200, 204,
207, 208, 210, 212, 213, 214, 216, 217,
219, 220, 224, 225, 232, 234, 235, 237,
238, 239, 240, 241, 242, 244, 246, 255,
256, 258, 259, 260, 261, 262, 264, 268,
271, 277, 288, 292, 293, 296, 298, 300,
304, 307, 311, 312, 314, 315, 328, 358,359, 365, 366, 367, 368, 370, 371, 373,
377, 383, 388, 392, 395, 397, 400, 409,
412, 413, 415, 418, 424, 427, 428, 431,
434, 457, 467
Document likelihood model
263
Document-partitioned index
77
Documents 15, 37, 60, 151,178, 204, 207, 216, 223, 227, 268, 270,
275, 431, 434, 476
character sequence decoding
classification of
Text classification 126, 330, 350,
353
defined
delineation of 21, 22, 35, 37, 38,42, 47, 75, 78, 79, 80, 86, 92, 125, 139,
151, 152, 153, 163, 171, 173, 177, 181,
194, 197, 198, 199, 201, 205, 239, 246,
301, 322, 330, 346, 388, 389, 410, 420,
421, 440, 441, 454, 457, 458, 459, 462,
466, 467, 473, 474, 475, 476, 489, 490,
492, 493
frequency defined
function notations partitioning 38, 49, 73, 77, 78,
79, 80, 211, 256, 369, 370, 371, 409,
410, 415, 463
relevant, retrieving
unit, choosing
vector, defined 327
Document space 234, 271,
280, 304, 352
Document zones 123, 353,
8/2/2019 Introduction to information retrieval
47/87
539
355
Doorway pages 440
Dot products 127, 128, 130, 131,
136, 306, 308, 329, 339, 342, 343, 345,346, 402, 403, 404, 412
described 61, 142, 144, 160,
168, 198, 204, 205, 207, 209, 222, 224,
231, 242, 260, 275, 285, 301, 310, 320,
322, 324, 349, 363, 367, 368, 378, 381,
396, 404, 412, 432, 454, 468, 471, 474,
475, 476, 491
in SVMs SVM
Duplicate elimination modules 459
Dynamic indexing 69, 70,
80, 83, 105, 467
Dynamic summary 178, 179,
180, 181
E
East Asian languages
ChineseJapanese 28, 48, 161
Edit distance 59, 61, 62, 63,
64, 65, 67
Effectiveness 7, 28, 33, 37, 38,
40, 42, 46, 48, 49, 68, 92, 110, 111,
112, 127, 135, 158, 160, 161, 162, 163,
166, 168, 173, 177, 180, 181, 182, 185,
189, 190, 191, 192, 193, 194, 199, 200,
219, 222, 224, 232, 247, 257, 258, 260,
261, 262, 263, 272, 279, 283, 284, 289,
290, 292, 293, 295, 296, 297, 298, 301,
313, 315, 317, 320, 322, 323, 328, 331,
334, 347, 349, 350, 351, 352, 353, 354,
360, 361, 364, 365, 367, 368, 370, 381,
388, 399, 420, 431, 458, 490
assessment of 27, 34, 157,
158, 159, 160, 168, 170, 171, 172, 173,
174, 181, 189, 190, 191, 193, 194, 195,
196, 212, 220, 230, 269, 288, 289, 292,
296, 304, 312, 317, 351, 355, 357, 358,
359, 370, 371
text classification 158,
160, 161, 260, 267, 268, 269, 270, 271,
272, 273, 276, 277, 280, 284, 285, 286,292, 293, 294, 295, 297, 298, 300, 303,
301, 304, 301, 303, 304, 305, 309, 310,
311, 315, 316, 319, 323, 324, 325, 328,
331, 333, 334, 341, 343, 347, 348, 350,
351, 352, 353, 354, 355, 360, 361, 382
Efficiency 9, 10, 13, 15, 31, 39,
46, 49, 62, 72, 77, 82, 103, 105, 109,
110, 112, 131, 135, 149, 190, 192, 195,
204, 271, 286, 293, 295, 296, 305, 315,329, 350, 352, 365, 368, 378, 388, 392,
402, 410, 412, 457
Eigen decomposition 421,
422
Eigenvalues 418, 419, 420,
421, 422, 423, 425, 427, 478, 480, 488
11-point interpolated average precision
11 166
Email 27document units 23, 25,
210
sorting 3, 4, 9, 13, 19, 25, 33,
72, 79, 117, 160, 213, 269, 350, 360,
399, 446, 466
EM algorithm EM 259, 365,
382, 383, 384, 385, 386, 387, 389, 392
Enterprise resource planning
87, 205Enterprise search 70, 87, 95
Entropy 104, 105, 106, 112, 264,
300, 372, 373
Equivalence classes 22, 31, 32,
34, 37
Ergodic Markov Chain
479, 480, 484
Euclidean distance
138
8/2/2019 Introduction to information retrieval
48/87
540
Euclidean length 127
Evalution of retrieval systems
7, 160
A/B test A/B 177ad hoc ad hoc 6, 196, 264,
268, 282, 298, 315, 333, 334, 352, 353,
355, 358, 361
clustering 2, 3, 26, 126, 147,
148, 265, 309, 352, 363, 364, 365, 366,
367, 368, 369, 370, 371, 372, 373, 374,
376, 377, 378, 379, 380, 381, 382, 383,
384, 385, 386, 387, 388, 389, 391, 392,
393, 394, 395, 396, 397, 398, 399, 400,401, 402, 403, 404, 405, 406, 407, 408,
409, 410, 411, 412, 413, 414, 415, 418,
427, 431, 432, 453, 467
F measure F 163, 164, 165, 168,
180, 221, 371, 374, 388
interpolated precision
165, 166, 167, 170, 171
kappa statistic kappa 172,
173, 175, 181keyword-in-context snippets
MAP 167
marginal relevance 222
normalized discounted cumulative gain
169
overview 19, 455, 456
pooling 168, 171, 181, 296
precision at k kprecision-recall curve -
165, 166, 167, 168, 169, 193
probabilistic information retrieval
229, 234, 242, 243, 247, 229,
234, 242, 243, 247, 265
ranked sets
relevance assessment
157, 158, 159, 160, 168, 170, 171, 172,
173, 174, 181, 194, 212, 220, 355, 357,
359
relevance feedback
177, 183, 184, 185, 186, 187,
188, 189, 190, 191, 192, 193, 194, 195,200, 193, 196, 193, 200, 184, 185, 186,
187, 188, 196, 188, 189, 190, 191, 192,
193, 194, 193, 194, 195, 196, 200, 230,
234, 236, 239, 240, 241, 242, 244, 246,
230, 234, 236, 195, 240, 241, 242, 244,
246, 230, 234, 236, 239, 240, 241, 242,
244, 246, 230, 234, 236, 239, 240, 241,
242, 244, 246, 262, 263, 264, 262, 263,
264, 262, 263, 264, 262, 263, 239, 309,310, 311, 330, 309, 310, 311, 330, 309,
310, 311, 330, 309, 310, 311, 330, 194,
195, 196, 200, 194, 195, 264
results snippets 157, 177,
178, 179, 180, 181
ROC curve ROC 169, 361
R-precis ion R 168, 169,
170, 180, 181
sensitivity 169specificity
summarization, static vs. dynamic
system quality/user utility
test collections, standard
157, 158, 160
text classification 158,
160, 161, 260, 267, 268, 269, 270, 271,272, 273, 276, 277, 280, 284, 285, 286,
292, 293, 294, 295, 297, 298, 300, 303,
301, 304, 301, 303, 304, 305, 309, 310,
311, 315, 316, 319, 323, 324, 325, 328,
331, 333, 334, 341, 343, 347, 348, 350,
351, 352, 353, 354, 355, 360, 361, 382
text summarization 152,
158, 180, 195, 415
unranked sets 224
8/2/2019 Introduction to information retrieval
49/87
541
XML retrieval XML 203, 205,
206, 209, 210, 212, 213, 215, 219, 221,
222, 223, 224, 219, 226, 205, 206, 209,
222, 212, 213, 215, 219, 221, 222, 223,224, 226, 227, 262, 221, 222, 223, 224,
226, 227, 224, 226, 227
Evidence accumulation
Exclusive clustering
Exhaustive clustering 369
E x p e c t a t i o n - M ax i m i z a t i o n ( E M )
algori thm EM 259, 365,
382, 383, 384, 385, 386, 387, 389, 392
Expectation step E 383, 384, 387Expected edge density
388
Extended query 196, 214,
216
Extensible Markup Language
XML 205
External criterion of quality
External sorting algorithm 73, 75
F
False negative 162, 371, 386
False positive 162, 163, 169,
371, 386
Feature engineering 331,
352, 354, 359
Feature selection/text classification
2 2 286, 290
frequency-based 292, 293
method comparison
multiple classifiers 343
mutual information 286,
287, 288, 289, 292, 294, 300, 301, 352,
354, 371, 372, 386, 411
noise feature 284, 286,
287, 316, 319
overfitting 286, 356
overview 19, 455, 456
in performance improvement
350, 351statistical significance
180, 260, 292, 293, 301
Fetch modules
Field 116, 117, 118, 152, 204,
206, 223, 224, 247
Filtering 2, 6, 56, 57, 58, 81, 91,
196, 209, 223, 268, 273, 327, 330, 348,
350, 352, 450, 459, 460, 461
First story detection 409,414
Flat clustering 363, 364,
365, 368, 370, 374, 379, 392, 394, 395,
409, 410, 413, 414
Akaike information criterion AIC
381, 387
cardinality in 30, 110, 181, 192,
223, 226, 232, 242, 285, 286, 301, 314,
334, 379, 434classification vs.
collections 2, 4, 6, 9, 11, 12,
15, 17, 18, 25, 30, 34, 44, 45, 47, 52,
58, 60, 65, 70, 72, 73, 74, 75, 77, 78,
79, 80, 81, 85, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99, 100, 101, 106, 108, 109,
111, 112, 116, 117, 119, 121, 124, 125,
128, 129, 130, 131, 135, 136, 137, 138,
144, 148, 151, 152, 158, 159, 160, 161,162, 165, 167, 168, 169, 170, 171, 173,
174, 175, 176, 178, 181, 184, 185, 188,
189, 191, 192, 194, 196, 198, 199, 200,
204, 207, 208, 210, 212, 213, 214, 216,
217, 219, 220, 224, 225, 232, 234, 235,
237, 238, 239, 240, 241, 242, 244, 246,
255, 256, 258, 259, 260, 261, 262, 264,
268, 271, 277, 288, 292, 293, 296, 298,
300, 304, 307, 311, 312, 314, 315, 328,
8/2/2019 Introduction to information retrieval
50/87
542
358, 359, 365, 366, 367, 368, 370, 371,
373, 377, 383, 388, 392, 395, 397, 400,
409, 412, 413, 415, 418, 424, 427, 428,
431, 434, 457, 467defined 2, 3, 4, 17, 22, 25, 26,
29, 42, 52, 55, 61, 62, 63, 79, 85, 90,
92, 94, 100, 104, 110, 118, 121, 125,
127, 133, 145, 160, 161, 162, 163, 165,
166, 167, 168, 170, 173, 180, 188, 195,
204, 208, 211, 213, 215, 216, 217, 219,
220, 221, 230, 234, 235, 237, 259, 262,
264, 271, 272, 276, 280, 283, 284, 287,
290, 292, 300, 301, 307, 308, 317, 318,319, 321, 324, 325, 327, 329, 334, 335,
336, 341, 344, 345, 346, 352, 353, 358,
359, 365, 368, 369, 371, 372, 374, 377,
379, 382, 383, 386, 387, 392, 393, 396,
397, 400, 404, 405, 407, 409, 412, 413,
424, 427, 437, 438, 443, 446, 450, 451,
465, 479, 480, 481, 487, 492
distortion 380, 381, 395
evaluation of 7, 46, 68, 145,155, 157, 158, 159, 160, 161, 163, 165,
167, 169, 170, 171, 173, 174, 175, 176,
177, 180, 181, 182, 193, 194, 200, 203,
206, 212, 219, 221, 222, 225, 226, 233,
260, 267, 271, 294, 295, 297, 298, 301,
324, 347, 353, 360, 363, 365, 369, 370,
371, 372, 374, 386, 388, 415, 432, 445,
449, 471
exhaustive 369, 409Expectation-Maximization algorithm
EM 259, 365, 382, 383, 384,
385, 386, 387, 389, 392
expectation step E 383, 384, 387
external criterion of quality
HAC vs . HAC
internal criterion of quality
K means K
K-medoids K- 415
in language models
maximization step M 383, 384,
387
model complexity 380,
381
normalized mutual information
371
objective functions 341,
369, 370, 375, 377, 379, 380, 382
outliers 334, 341, 377, 378,396, 399
partitional 369
purity 371, 372, 373, 374, 386
Rand index, adjusted
388
residual sum of squares
374, 380, 395
scatter-gather - 366, 388
search result 7, 17, 32, 82,84, 222
seeds 375, 376, 377, 378, 384,
386, 389, 413, 457
singleton 378
soft 3, 23, 33, 36, 87, 96, 139,
150, 329, 340, 341, 342, 347, 365, 369,
383, 384, 385, 386, 387, 389, 392, 415,
424, 431, 432, 446
unsupervised learning F measure F 163, 164, 165, 168,
180, 221, 371, 374, 388
Focused retrieval 222
Free text
Free text query
p a r s i n g f u n c t i o n s
designing 49, 53, 70, 71, 77,
80, 87, 96, 100, 150, 151, 152, 158,
171, 176, 178, 179, 209, 212, 213, 224,
8/2/2019 Introduction to information retrieval
51/87
543
244, 254, 307, 329, 435, 446, 456, 457,
458, 463, 466, 475
tokenization 8, 22, 25, 26,
27, 28, 34, 42, 95, 152, 230, 314in vector retrieval models
234
Frequency-based feature selection
292
Frobenius norm F 425, 426, 430
Front coding 98, 99, 100,
109
Front queues 464, 465, 466,
474Functional margins 336
G
GAAC Group-average agglomerative
clustering 403, 404, 405, 406, 407,
408, 409, 412, 413, 414, 415
encoding 103, 104, 105,
106, 107, 108, 109, 110, 111
Gaps, encoding 468, 470
Generative model 278,
327
Geometric margin 337
Global champion list
Gold standard 159, 370, 371
Golomb codes Golomb 112
GOV2 collection GOV2
Greedy feature selection
Grepping
Ground truth 159
Group-average agglomerative clustering
391, 400, 403
Group-average clustering
403, 405, 413
H
HAC hierarchical agglomerative
clustering (HAC) 392, 393, 395,
396, 399, 400, 401, 403, 404, 405, 406,
408, 409, 410, 412, 413, 414
Hard assignment 365, 381, 384
Hard clustering 365, 369, 383,386
Harmonic numbers 106
Hashing 14, 52, 53, 55, 66, 76,
99, 331, 450, 451, 461, 467, 470
Heaps law Heaps 93, 111, 315
Held-out data 298, 313
Hierarchical agglomerative clustering
(HAC) 392, 393,
395, 396, 399, 400, 401, 403, 404, 405,406, 408, 409, 410, 412, 413, 414
algorithm comparison
best-merge persistence
400, 402, 404, 405, 413
Buckshot algorithm Buckshot
378, 414
centroids 188, 189, 190, 192,
304, 306, 307, 308, 309, 310, 311, 314,
327, 329, 331, 369, 374, 375, 376, 377,378, 379, 381, 382, 384, 386, 388, 391,
392, 396, 400, 401, 404, 405, 406, 407,
409, 411, 412, 413, 414, 415, 432
chaining in 399, 409
cliques 246, 398
cluster-internal labeling
combination similarity
complete-link clustering 391, 396, 397, 398, 399, 402, 403,
404, 408
connected components
398
dendrograms 393, 394, 397,
398, 406, 413
differential cluster labeling
410
divisive 391, 392, 409, 410,
8/2/2019 Introduction to information retrieval
52/87
544
415
first story detection
409, 414
flat vs. group-average 391, 392,
396, 400, 401, 403, 404, 405, 409, 413,
414, 415
inversions 393, 405, 406, 407,
409
monotonicity 393, 406, 413
next-best merge (NBM) arrays
novelty detection 388,409
optimality 13, 301, 314,
391, 392, 407, 408
outliers 334, 341, 377, 378,
396, 399
overview 19, 455, 456
priority queue algorithm
401
single-link clustering 394, 396, 397, 398, 399, 400, 402, 408,
409, 413, 414, 453
time complexity 12,
13, 14, 17, 39, 61, 75, 77, 81, 276, 277,
278, 299, 300, 311, 314, 315, 329, 331,
342, 343, 378, 379, 386, 392, 399, 400,
403, 404, 405, 409, 410, 412, 413, 414
top-down 211, 392, 393,
409, 410Hierarchical classification
351, 360
Hierarchical clustering
agg lomera t ive hiera rch ica l
agglomerative clustering (HAC)
applications 6, 10, 18, 23, 37,
44, 70, 72, 74, 77, 83, 86, 87, 96, 102,
108, 118, 123, 137, 139, 143, 144, 145,
151, 152, 158, 160, 168, 169, 173, 174,
176, 181, 189, 191, 192, 193, 194, 198,
204, 205, 208, 213, 215, 221, 224, 226,
243, 252, 253, 254, 257, 258, 264, 268,269, 271, 272, 275, 280, 286, 297, 298,
299, 305, 310, 315, 327, 331, 334, 344,
346, 347, 348, 349, 350, 351, 352, 360,
361, 363, 365, 366, 367, 368, 370, 375,
381, 382, 383, 384, 386, 387, 388, 389,
392, 394, 395, 407, 408, 409, 410, 413,
414, 418, 425, 432, 434, 436, 437, 448,
450, 451, 453, 454, 457, 465, 467, 468
defined 2, 3, 4, 17, 22, 25, 26,29, 42, 52, 55, 61, 62, 63, 79, 85, 90,
92, 94, 100, 104, 110, 118, 121, 125,
127, 133, 145, 160, 161, 162, 163, 165,
166, 167, 168, 170, 173, 180, 188, 195,
204, 208, 211, 213, 215, 216, 217, 219,
220, 221, 230, 234, 235, 237, 259, 262,
264, 271, 272, 276, 280, 283, 284, 287,
290, 292, 300, 301, 307, 308, 317, 318,
319, 321, 324, 325, 327, 329, 334, 335,336, 341, 344, 345, 346, 352, 353, 358,
359, 365, 368, 369, 371, 372, 374, 377,
379, 382, 383, 386, 387, 392, 393, 396,
397, 400, 404, 405, 407, 409, 412, 413,
424, 427, 437, 438, 443, 446, 450, 451,
465, 479, 480, 481, 487, 492
probabilistic interpretation of
R environment support for R
Hierarchical Dirichlet processes HDP
Hierarchy 272, 351, 392,
394, 395, 409, 410, 415, 434, 435
Highlighting
HITS (hyperlink-induced topic search)
474, 489, 490,
491, 493
8/2/2019 Introduction to information retrieval
53/87
545
Host splitters 461
HTML 24, 178, 205,
224, 434, 435, 438, 440, 446, 453, 460,
474, 475http 16, 27, 36, 70,
86, 87, 161, 180, 181, 185, 186, 201,
340, 366, 367, 370, 388, 415, 432, 434,
435, 440, 446, 448, 459, 460, 475
Hub score 487
Hyperlink-induced topic search (HITS)
Hyperlinks Link analysis
Hyphenation and tokenization 3, 438, 439, 474, 487
I
Ide dec-hi 190, 200
IDF Inverse document frequency
(IDF) 190
IID Independent and identically
distributed (IID)
Images, searching for
Relevance feedback 177,
183, 184, 185, 186, 187, 188, 189, 190,
191, 192, 193, 194, 195, 200, 193, 196,
193, 200, 184, 185, 186, 187, 188, 196,
188, 189, 190, 191, 192, 193, 194, 193,
194, 195, 196, 200, 230, 234, 236, 239,
240, 241, 242, 244, 246, 230, 234, 236,
195, 240, 241, 242, 244, 246, 230, 234,
236, 239, 240, 241, 242, 244, 246, 230,
234, 236, 239, 240, 241, 242, 244, 246,
262, 263, 264, 262, 263, 264, 262, 263,
264, 262, 263, 239, 309, 310, 311, 330,
309, 310, 311, 330, 309, 310, 311, 330,
309, 310, 311, 330, 194, 195, 196, 200,
194, 195, 264
Impact ordering 84, 147
Implicit relevance feedback
177, 193, 195
Incidence matrix 5, 108, 109
Independence 233, 234, 235,
242, 243, 247, 274, 281, 282, 283, 285,
290, 291, 292, 293, 299, 301, 321
Independent and identically distributed(IID)
Index construction
BSBI 73, 74, 75, 76, 77, 78, 86
distributed indexes 69,
70, 77, 78, 80, 86, 455, 456, 466, 471
resources 3, 81, 87, 96, 180,
195, 197, 205, 264, 434, 457
Indexer 70, 78, 111, 152,
153, 180, 440, 445, 457, 459Indexes 1, 4, 5, 7, 8, 9, 10, 11,
13, 17, 18, 19, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 34, 36, 38, 39, 40, 42,
43, 44, 45, 46, 47, 48, 49, 52, 55, 56,
57, 58, 59, 62, 63, 64, 65, 66, 67, 69,
70, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 89, 90, 91,
92, 95, 99, 102, 105, 106, 107, 108,
109, 110, 111, 112, 113, 115, 116, 117,118, 119, 124, 128, 129, 130, 131, 138,
142, 144, 145, 146, 147, 150, 151, 152,
153, 154, 155, 158, 161, 175, 178, 179,
180, 184, 196, 197, 199, 204, 205, 210,
211, 212, 216, 218, 224, 225, 230, 268,
269, 315, 331, 365, 367, 368, 377, 412,
417, 418, 427, 431, 433, 434, 435, 436,
437, 440, 445, 446, 447, 448, 449, 450,
451, 453, 454, 455, 456, 457, 459, 466,467, 468, 470, 471, 475, 490
biword 42, 43, 44, 46, 155
defined 2, 3, 4, 17, 22, 25, 26,
29, 42, 52, 55, 61, 62, 63, 79, 85, 90,
92, 94, 100, 104, 110, 118, 121, 125,
127, 133, 145, 160, 161, 162, 163, 165,
166, 167, 168, 170, 173, 180, 188, 195,
204, 208, 211, 213, 215, 216, 217, 219,
220, 221, 230, 234, 235, 237, 259, 262,
8/2/2019 Introduction to information retrieval
54/87
546
264, 271, 272, 276, 280, 283, 284, 287,
290, 292, 300, 301, 307, 308, 317, 318,
319, 321, 324, 325, 327, 329, 334, 335,
336, 341, 344, 345, 346, 352, 353, 358,359, 365, 368, 369, 371, 372, 374, 377,
379, 382, 383, 386, 387, 392, 393, 396,
397, 400, 404, 405, 407, 409, 412, 413,
424, 427, 437, 438, 443, 446, 450, 451,
465, 479, 480, 481, 487, 492
document-partitioned
k-gram k-gram 48, 57, 65
next word 46, 98parametric 40, 47, 49, 61, 70,
71, 75, 80, 85, 93, 105, 112, 115, 116,
117, 144, 145, 148, 153, 159, 160, 173,
174, 177, 189, 206, 242, 244, 245, 247,
254, 257, 258, 259, 263, 273, 274, 276,
278, 280, 281, 282, 283, 284, 287, 289,
298, 299, 300, 301, 305, 309, 311, 314,
316, 317, 318, 323, 327, 330, 331, 337,
341, 347, 353, 354, 357, 359, 361, 379,381, 382, 383, 384, 385, 386, 387, 477
permuterm 56, 57, 58, 59, 62,
67
positional 7, 8, 10, 12, 21, 24,
29, 39, 40, 41, 43, 44, 45, 46, 47, 48,
71, 72, 80, 83, 90, 91, 92, 95, 98, 100,
113, 116, 152, 153, 165, 167, 170, 178,
179, 215, 230, 254, 274, 278, 280, 281,
282, 283, 285, 299, 318, 328, 334, 335,350, 354, 366, 376, 434, 443, 444, 461,
469, 477, 479
size/estimation
term-partitioned
zone 71, 116, 117, 118, 119,
120, 123, 148, 178, 206, 304, 305, 306,
307, 308, 310, 312, 319, 320, 322, 329,
353, 354, 355, 357, 396, 402
Indexing
defined 2, 3, 4, 17, 22, 25, 26,
29, 42, 52, 55, 61, 62, 63, 79, 85, 90,
92, 94, 100, 104, 110, 118, 121, 125,
127, 133, 145, 160, 161, 162, 163, 165,166, 167, 168, 170, 173, 180, 188, 195,
204, 208, 211, 213, 215, 216, 217, 219,
220, 221, 230, 234, 235, 237, 259, 262,
264, 271, 272, 276, 280, 283, 284, 287,
290, 292, 300, 301, 307, 308, 317, 318,
319, 321, 324, 325, 327, 329, 334, 335,
336, 341, 344, 345, 346, 352, 353, 358,
359, 365, 368, 369, 371, 372, 374, 377,
379, 382, 383, 386, 387, 392, 393, 396,397, 400, 404, 405, 407, 409, 412, 413,
424, 427, 437, 438, 443, 446, 450, 451,
465, 479, 480, 481, 487, 492
distributed 3, 69, 70, 77,
78, 79, 80, 86, 455, 456, 457, 459, 461,
462, 466, 467, 471
granularity 24, 25, 34, 103,
124, 173, 174, 222, 255, 388, 437
latent semantic 92, 199,368
unit defined
INEX 174, 206, 219, 220, 222, 223,
225, 227
Informational queries 444,
445
Information gain 300, 301,
410, 415
Information need 2, 3, 6, 7,16, 17, 135, 158, 159, 160, 166, 167,
168, 169, 170, 171, 174, 175, 178, 185,
193, 196, 210, 220, 230, 234, 245, 246,
257, 262, 268, 365, 368
Information retrieval 1, 2,
3, 4, 5, 7, 25, 26, 70, 83, 89, 90, 91, 94,
95, 103, 120, 6, 126, 18, 30, 84, 113,
135, 149, 141, 157, 158, 160, 161, 162,
171, 173, 184, 185, 189, 191, 196, 209,
8/2/2019 Introduction to information retrieval
55/87
547
220, 247, 249, 250, 253, 268, 363, 365,
441
hardware issues 78, 87
history of 253, 254, 294, 300,393, 407, 433, 434, 439, 443, 445, 465,
484
overview 19, 455, 456
search system components
terms, statistical properties of
89, 91
In-links 438, 474, 477, 482
Inner product Dot products 127, 128, 130, 131, 136, 306,
308, 329, 339, 342, 343, 345, 346, 402,
403, 404, 412
Instance-based learning
314
Internal criterion of quality
Interpolated precision
165, 166, 167, 170, 171Intersection, postings list
Inter-similarity
Inverse document frequency (IDF)
217
Inversions 393, 405, 406, 407,
409
defined 2, 3, 4, 17, 22, 25, 26,
29, 42, 52, 55, 61, 62, 63, 79, 85, 90,92, 94, 100, 104, 110, 118, 121, 125,
127, 133, 145, 160, 161, 162, 163, 165,
166, 167, 168, 170, 173, 180, 188, 195,
204, 208, 211, 213, 215, 216, 217, 219,
220, 221, 230, 234, 235, 237, 259, 262,
264, 271, 272, 276, 280, 283, 284, 287,
290, 292, 300, 301, 307, 308, 317, 318,
319, 321, 324, 325, 327, 329, 334, 335,
336, 341, 344, 345, 346, 352, 353, 358,
359, 365, 368, 369, 371, 372, 374, 377,
379, 382, 383, 386, 387, 392, 393, 396,
397, 400, 404, 405, 407, 409, 412, 413,
424, 427, 437, 438, 443, 446, 450, 451,465, 479, 480, 481, 487, 492
in HAC HAC 393, 395, 396,
399, 400, 403, 404, 405, 408, 409, 410,
412, 413, 414
Inverted file Inverted
indexPostings list 7, 100, 146
Inverted index 1, 4, 7, 8, 9,
10, 11, 17, 19, 22, 40, 48, 52, 55, 56,
57, 58, 66, 67, 70, 72, 73, 74, 75, 76,77, 79, 82, 84, 87, 90, 92, 105, 106,
107, 108, 109, 111, 113, 117, 119, 130,
142, 144, 147, 152, 153, 196, 204, 205,
218, 315, 331, 367, 368, 412, 435, 468
Boolean query processing
building principles
described 61, 142, 144, 160,
168, 198, 204, 205, 207, 209, 222, 224,231, 242, 260, 275, 285, 301, 310, 320,
322, 324, 349, 363, 367, 368, 378, 381,
396, 404, 412, 432, 454, 468, 471, 474,
475, 476, 491
encoding 103, 104, 105,
106, 107, 108, 109, 110, 111
kNN classification in kNN
Inverter 79, 80, 85
IP address IP 27, 462
J
Jaccard coefficient Jaccard 64 ,
65, 451, 452, 453
Japanese 28, 34, 35, 48, 55, 491
Journal influence weight
492
K
Kappa s ta t is t ic kappa 172,
8/2/2019 Introduction to information retrieval
56/87
548
173, 175, 181
Kernel function 346
Kernels 7, 9, 36, 49, 61, 62, 73,
196, 197, 204, 209, 268, 340, 345, 346,347, 348, 359, 360, 419, 487
Mercer Mercer 346
polynomial 254, 256, 257,
258, 259, 260, 273, 274, 275, 276, 277,
278, 279, 280, 281, 282, 283, 284, 285,
286, 289, 290, 292, 293, 299, 300, 321,
329, 346, 347, 419
quadratic 82, 105, 323, 346
radial basis functions 346, 347
Kernel trick 345
Keys 5, 18, 30, 151, 154,
171, 178, 179, 192, 209, 213, 220, 222,
270, 352, 353, 354, 366, 367, 435, 439,
440, 442, 443, 444, 449
Key-value pairs - 78, 79, 80, 85
Keyword-in-context (KWIC) snippets
k-gram index k-gram 48, 57, 65
described 61, 142, 144, 160,
168, 198, 204, 205, 207, 209, 222, 224,
231, 242, 260, 275, 285, 301, 310, 320,
322, 324, 349, 363, 367, 368, 378, 381,
396, 404, 412, 432, 454, 468, 471, 474,
475, 476, 491
spelling correction in 51,
52, 59, 60, 62, 63, 64, 65, 67, 68, 83,85, 152, 153, 184, 191
word matching in 17, 59
K means K
K-medoids K- 415
k nearest neighbor classification (kNN)
algorithm k 311
Bayes error rate 316,
331
bias in 254, 264, 286, 303,
305, 323, 324, 325, 326, 327, 328, 331,
335, 349
decision boundaries 307,
317, 319, 320, 326, 327, 328, 329, 334,336, 337, 339, 341, 347, 357, 359
described 61, 142, 144, 160,
168, 198, 204, 205, 207, 209, 222, 224,
231, 242, 260, 275, 285, 301, 310, 320,
322, 324, 349, 363, 367, 368, 378, 381,
396, 404, 412, 432, 454, 468, 471, 474,
475, 476, 491
effectiveness 7, 28, 33, 37,
38, 40, 42, 46, 48, 49, 68, 92, 110, 111,112, 127, 135, 158, 160, 161, 162, 163,
166, 168, 173, 177, 180, 181, 182, 185,
189, 190, 191, 192, 193, 194, 199, 200,
219, 222, 224, 232, 247, 257, 258, 260,
261, 262, 263, 272, 279, 283, 284, 289,
290, 292, 293, 295, 296, 297, 298, 301,
313, 315, 317, 320, 322, 323, 328, 331,
334, 347, 349, 350, 351, 352, 353, 354,
360, 361, 364, 365, 367, 368, 370, 381,388, 399, 420, 431, 458, 490
instance-based learning
314
memory-based learning
314
memory capaci ty 327,
335, 336
multinomial Naive Bayes vs.
256, 273as nonlinear classification
303, 305, 316, 320, 323, 324
testing/training capacity
time complexity/optimality
variance 254, 286, 303, 305,
323, 324, 325, 326, 327, 328, 331, 335,
381, 382, 383, 387, 414
8/2/2019 Introduction to information retrieval
57/87
549
Voronoi tessellation Voronoi
312, 329
KNN classification KNN K
nearestneighbor classification kNN K
311
Kruskals algorithm Kruskal
Kullback-Leibler divergence KL
264, 330, 387
KWIC (keyword-in-context)
L
Labeling 391, 410, 411, 412,
415
of clusters 365, 391, 392,
410, 411, 412, 415
defined 2, 3, 4, 17, 22, 25, 26,
29, 42, 52, 55, 61, 62, 63, 79, 85, 90,
92, 94, 100, 104, 110, 118, 121, 125,
127, 133, 145, 160, 161, 162, 163, 165,
166, 167, 168, 170, 173, 180, 188, 195,
204, 208, 211, 213, 215, 216, 217, 219,
220, 221, 230, 234, 235, 237, 259, 262,
264, 271, 272, 276, 280, 283, 284, 287,
290, 292, 300, 301, 307, 308, 317, 318,
319, 321, 324, 325, 327, 329, 334, 335,
336, 341, 344, 345, 346, 352, 353, 358,
359, 365, 368, 369, 371, 372, 374, 377,
379, 382, 383, 386, 387, 392, 393, 396,
397, 400, 404, 405, 407, 409, 412, 413,
424, 427, 437, 438, 443, 446, 450, 451,
465, 479, 480, 481, 487, 492
Language of an automaton
250, 251, 252
Language identification 34,
48
Language issues relevance
feedback 177, 183, 184,
185, 186, 187, 188, 189, 190, 191, 192,
193, 194, 195, 200, 193, 196, 193, 200,
184, 185, 186, 187, 188, 196, 188, 189,
190, 191, 192, 193, 194, 193, 194, 195,
196, 200, 230, 234, 236, 239, 240, 241,
242, 244, 246, 230, 234, 236, 195, 240,241, 242, 244, 246, 230, 234, 236, 239,
240, 241, 242, 244, 246, 230, 234, 236,
239, 240, 241, 242, 244, 246, 262, 263,
264, 262, 263, 264, 262, 263, 264, 262,
263, 239, 309, 310, 311, 330, 309, 310,
311, 330, 309, 310, 311, 330, 309, 310,
311, 330, 194, 195, 196, 200, 194, 195,
264
Language models 67, 139,222, 226, 243, 247, 249, 250, 251, 252,
253, 254, 255, 256, 258, 260, 261, 262,
263, 264, 265, 277, 368, 382
Bayesian smoothing 265
BIM/XML vs., 230
clustering in
defined 2, 3, 4, 17, 22, 25, 26,
29, 42, 52, 55, 61, 62, 63, 79, 85, 90,
92, 94, 100, 104, 110, 118, 121, 125,127, 133, 145, 160, 161, 162, 163, 165,
166, 167, 168, 170, 173, 180, 188, 195,
204, 208, 211, 213, 215, 216, 217, 219,
220, 221, 230, 234, 235, 237, 259, 262,
264, 271, 272, 276, 280, 283, 284, 287,
290, 292, 300, 301, 307, 308, 317, 318,
319, 321, 324, 325, 327, 329, 334, 335,
336, 341, 344, 345, 346, 352, 353, 358,
359, 365, 368, 369, 371, 372, 374, 377,379, 382, 383, 386, 387, 392, 393, 396,
397, 400, 404, 405, 407, 409, 412, 413,
424, 427, 437, 438, 443, 446, 450, 451,
465, 479, 480, 481, 487, 492
distributions multinomial
254, 256, 257, 258, 259, 260,
273, 274, 275, 276, 277, 278, 279, 280,
281, 282, 283, 284, 285, 286, 289, 290,
292, 293, 299, 300, 321, 329, 346, 347,
8/2/2019 Introduction to information retrieval
58/87
550
419
document likelihood
263, 264
extended approaches 119,197, 198, 201, 230, 250, 361, 432
finite automata and
250, 251
Kullback-Leibler divergence KL
264, 330, 387
likelihood ratio 252, 254,
255, 264
linear interpolation 258,
259, 265overview 19, 455, 456
query likelihood 249,
250, 255, 261, 263, 264, 432
tf-idf weighting vs. tf-idf
translation 2, 5, 35, 59, 184,
200, 253, 256, 264, 386, 450, 490
types of 2, 13, 27, 30, 39, 42,
45, 48, 69, 83, 116, 123, 160, 206, 208,209, 212, 220, 224, 237, 256, 262, 268,
269, 271, 323, 326, 348, 349, 350, 358,
359, 369, 371, 383, 392, 447, 452
Laplace smoothing 275
Latent Dirichlet Allocation (LDA) LDA
Latent semantic analysis (LSA)
92
Latent semantic indexing (LSI) 199, 368
LDA (Latent Dirichlet Allocation) LDA
L2 distance L2
Learning algorithm described
Weighted zone scoring
Learning error
Learning method 22, 42,
121, 152, 155, 174, 226, 270, 271, 272,
273, 298, 300, 316, 323, 324, 325, 326,
327, 328, 333, 334, 343, 347, 348, 350,
355, 361, 482Lemma 36, 37
Lemmatization described
Lemmatizer 37, 92
Length-normalization
Levenshtein distance Levenshtein
Lexicalized subtrees 215,
216, 225
Lexicons in inverted indexes
52Likelihood 40, 43, 53, 61, 103,
117, 134, 148, 231, 232, 236, 238, 240,
249, 250, 252, 254, 255, 256, 261, 263,
264, 274, 281, 292, 313, 322, 326, 328,
330, 381, 382, 383, 384, 387, 432, 464
Likelihood ratio 252, 254,
255, 264
Linear algebra review
Linear classifiers 303,305, 316, 317, 318, 319, 320, 321, 322,
323, 324, 326, 328, 331, 336, 344, 345,
350, 356, 357, 358
Linear interpolation 258,
259, 265
Linear problem 319, 320,
326, 342
Linear separability 317, 319,
320, 328, 330, 333, 334, 341, 344, 345,359
Link analysis 441, 467, 473,
474, 476, 489, 490, 493
anchor text 264, 411, 412,
454, 459, 474, 475, 476, 483, 489, 492
authority score authority 487,
488, 489, 490, 491, 492
ergodic Markov chain
8/2/2019 Introduction to information retrieval
59/87
551
HITS 474, 489, 490, 491, 493
hub score 487
Markov chains 477, 478,
479, 480, 481, 482, 484, 485, 486, 492overview 19, 455, 456
PageRank PageRank 485,
486
steady-state theorem
Link farms 493
Link spam
LLRUN 112
L M L a n g u a g e
models 67, 139, 222, 226,243, 247, 249, 250, 251, 252, 253, 254,
255, 256, 258, 260, 261, 262, 263, 264,
265, 277, 368, 382
Logarithmic merging 82, 83,
86
Lossless compression 92
Lossy compression 92
Lovins stemmer Lovins
Low-rank approximation 417, 418, 420, 422, 424, 425, 426, 427,
430, 432
LSA latent semantic analysis
92
LSI latent semantic indexing
199, 368
M
Machine-learned relevance described
Machine learning methods
22, 42, 152, 155, 174, 226, 298,
333, 334, 347, 348, 350, 355, 361, 482
Machine translation 253,
264, 386
Macroaveraging 296
MAP mean average precision
167
Map phase map
MapReduce 77, 78, 79, 80, 85, 86
Marginal relevance 222
Marginal statistic 172
Margins 83, 92, 209, 269, 323,336, 337, 338, 339, 341, 347, 463, 464,
468, 470
Markov chains 477, 478,
479, 480, 481, 482, 484, 485, 486, 492
Master node 77, 78, 79, 80
Matrix decomposition 417,
418, 420, 421, 424, 432
eigen 5, 27, 33, 137, 150, 152,
267, 271, 276, 279, 283, 284, 286, 287,288, 289, 290, 291, 292, 293, 298, 299,
300, 301, 305, 316, 319, 320, 322, 323,
327, 328, 331, 334, 336, 339, 342, 343,
344, 345, 347, 348, 351, 352, 353, 354,
355, 356, 358, 359, 360, 361, 410, 412,
418, 419, 420, 421, 422, 423, 425, 427,
434, 476, 478, 480, 488, 489, 490, 491,
492
eigenvalues 418, 419, 420,421, 422, 423, 425, 427, 478, 480, 488
Frobenius norm
latent semantic indexing
199, 368
linear algebra review
low-rank approximation
417, 418, 420, 422, 424, 425, 426, 427,
430, 432
reduced SVD SVD 423, 426singular value 419, 421,
423, 424, 425, 426, 428, 429
symmetric diagonal
422, 423
theorems 231, 233, 234, 280,
414, 421, 422, 423, 424, 425, 432, 451,
452, 480, 481
truncated SVD SVD 423,
426
8/2/2019 Introduction to information retrieval
60/87
552
Maximization step M 383, 384, 387
Maximum a posteriori
Maximum likelihood estimate (MLE)
238, 240, 256, 274,384, 387
Mean average precision
167
Medoids 379, 389, 415
Memory-based learning
314
Memory capacity 327, 335,
336
Mercator crawler Mercator458, 470
Mercer kernels Mercer 346
Merge algorithm 12, 13,
14, 15, 21, 39, 40, 41, 45, 47, 86, 153,
404
Merge postings list
13, 14, 39, 40, 41, 44, 46, 47, 142, 153
Metadata 27, 116, 152, 153,
178, 388Microaveraging 296, 299,
348
Minimum spanning tree
413, 414
Minimum variance clustering
ModApte split
Model-based clustering
Model complexity 380,381
Monotonicity 393, 406, 413
Multiclass classification
321, 343
Multiclass SVMs SVM 360
Multilabel classification
322, 323, 332
Multimodal class
Multinomial classification
299
Multinomial model 254,
260, 278, 279, 280, 281, 282, 283, 284,
285, 289, 290, 292, 293, 300Multinomial Naive Bayes
256, 273
Bernoulli model 259,
260, 267, 278, 279, 280, 281, 282, 283,
284, 285, 287, 289, 290, 292, 300, 383,
384
bias in 254, 264, 286, 303,
305, 323, 324, 325, 326, 327, 328, 331,
335, 349concept drift 284, 285,
298, 301
conditional independence assumption
235, 281, 282, 283
as linear classifier
optimal classifier 316
positional independence assumption
274, 282, 283, 285,
299properties 3, 23, 30, 89, 90,
91, 94, 178, 234, 314, 323, 327, 381,
433, 436, 441, 458, 468
in query likelihood models
249, 250, 255, 261, 264, 432
random variables X and U
X U
semi-supervised learning
sparseness 7, 213, 219, 254,
257, 275, 277, 282, 352, 368
testing/training capacity
in text classification 158,
160, 161, 260, 267, 268, 269, 270, 271,
272, 273, 276, 277, 280, 284, 285, 286,
292, 293, 294, 295, 297, 298, 300, 303,
301, 304, 301, 303, 304, 305, 309, 310,
8/2/2019 Introduction to information retrieval
61/87
553
311, 315, 316, 319, 323, 324, 325, 328,
331, 333, 334, 341, 343, 347, 348, 350,
351, 352, 353, 354, 355, 360, 361, 382
variance 254, 286, 303, 305,323, 324, 325, 326, 327, 328, 331, 335,
381, 382, 383, 387, 414
Mul t inomia l NB NB
Multinomial Naive Bayes 273, 274,
275, 277, 281, 299
Multivalue classification
321
Multivariate Bernoulli model
Mutual information 286,
287, 288, 289, 292, 294, 300, 301, 352,
354, 371, 372, 386, 411
N
Naive Bayes assumption
Naive Bayes learning method
Multinomial Naive
Bayes; Multivariate Bernoulli model
271
Named entity tagging
352
Nat iona l Ins t i tu te of S tandards and
Technology
Natural language processing
37, 178, 262
issues in 22, 37, 175, 415
lemmatizers in 35, 36, 37,
48, 92
text summarization 178,
181, 354
XML retrieval XML
Navigational queries 444,
445
NDCG (normalized discounted cumulative
gain) 169
Near-duplicate search results
Nested elements 212
NEXI 208, 209, 210, 227
Next-best merge (NBM) arrays
Next word index 46
N-gram language model N
Bigram language model
Unigram language model
Nibble 4 103,
112
NLP Natural language processing
37, 178, 262
N M I N o r m a l i z e d m u t u a linformation (NMI)
371, 372, 373, 374, 388
Noise documents 319, 326,
327, 328, 341
Noise feature 284, 286, 287,
316, 319
Nonlinear classifiers
303, 316, 320, 323, 324
Nonlinear problem 320,326, 342
Normalization
in probability theory 104,
229, 230, 231, 240, 243, 247, 257, 262
term 3, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 21, 22, 25,
26, 27, 29, 30, 31, 32, 34, 37, 38, 40,
41, 42, 43, 44, 45, 46, 52, 53, 54, 55,
56, 57, 58, 59, 60, 61, 62, 63, 64, 65,66, 67, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 89, 90, 91,
92, 93, 94, 95, 96, 97, 98, 99, 100, 106,
107, 108, 109, 110, 111, 112, 113, 115,
116, 118, 119, 121, 123, 124, 125, 126,
127, 129, 130, 131, 132, 133, 134, 47,
135, 134, 135, 137, 138, 139, 142, 143,
144, 145, 146, 147, 149, 150, 152, 153,
154, 155, 171, 178, 179, 184, 187, 190,
8/2/2019 Introduction to information retrieval
62/87
554
191, 192, 195, 196, 197, 198, 199, 200,
208, 210, 212, 213, 214, 215, 216, 217,
218, 219, 225, 226, 230, 233, 234, 235,
236, 237, 238, 239, 240, 241, 242, 243,244, 245, 246, 250, 251, 252, 253, 254,
256, 257, 260, 261, 262, 263, 264, 273,
274, 275, 276, 278, 279, 280, 281, 282,
283, 284, 285, 286, 287, 288, 289, 290,
291, 292, 293, 294, 298, 304, 309, 311,
314, 315, 318, 329, 343, 347, 352, 355,
359, 365, 366, 367, 368, 379, 382, 383,
384, 385, 410, 411, 412, 415, 417, 418,
422, 423, 424, 425, 427, 428, 429, 430,431, 432, 439, 444, 448, 450, 459, 466,
467, 471, 475, 476, 490
tf weighting tf 123, 131
URL 27, 434, 437, 453, 457, 458,
459, 460, 461, 462, 463, 464, 465, 466,
467, 468, 469, 470, 477
Normalized discounted cumulative gain
(NDCG)
Normalized mutual information (NMI)
Normalized tokens in inverted indexes
Normal vectors 308, 317, 335
Novelty detection 388,
409
NTCIR collection NTCIR
O
Objective function 341, 369,
370, 375, 377, 379, 380, 382
Odds 232
Odds ratio
Okapi BM25 weighting Okapi BM25
230
1/0 loss 1/0 233
One-of classification 321,
323, 331
One-versus-all (OVA) classification
343
Optimal classifier 316
Optimal clustering
Optimal learning method Optimal weight
Ordering 4, 8, 9, 10, 13, 15, 17,
18, 30, 53, 55, 58, 59, 69, 70, 72, 73,
74, 75, 76, 77, 79, 83, 84, 85, 86, 96,
98, 111, 113, 116, 119, 120, 125, 126,
130, 133, 134, 141, 142, 143, 145, 146,
147, 149, 150, 151, 153, 154, 155, 158,
167, 169, 170, 173, 177, 178, 181, 195,
205, 209, 213, 217, 223, 224, 226, 229,230, 232, 233, 234, 235, 237, 239, 246,
250, 252, 255, 256, 259, 261, 262, 264,
270, 277, 292, 322, 330, 334, 355, 357,
358, 359, 361, 368, 401, 435, 436, 442,
443, 448, 451, 453, 459, 468, 469, 474,
476, 482, 487, 490, 492
Ordinal regression 358, 361
Outliers 334, 341, 377, 378,
396, 399Out-links 438, 477, 479
Overfitting 286, 356
Overlap score measure
126
Oxford English Dictionary
92
P
PageRank 485
computation 7, 10, 12, 13, 19,
30, 31, 41, 44, 47, 48, 53, 60, 61, 62,
63, 64, 65, 67, 68, 70, 72, 75, 77, 78,
81, 85, 90, 96, 102, 106, 107, 109, 110,
111, 113, 115, 116, 118, 119, 120, 121,
122, 123, 124, 125, 126, 127, 128, 129,
130, 131, 132, 133, 134, 135, 136, 137,
138, 141, 142, 143, 144, 145, 146, 147,
148, 149, 150, 152, 153, 158, 162, 163,
164, 165, 166, 167, 168, 169, 170, 171,
8/2/2019 Introduction to information retrieval
63/87
555
172, 173, 175, 176, 177, 179, 180, 181,
185, 188, 190, 191, 192, 193, 195, 197,
198, 199, 213, 215, 216, 217, 218, 219,
222, 226, 230, 231, 232, 234, 235, 236,237, 239, 139, 155, 239, 237, 239, 240,
241, 242, 243, 244, 245, 246, 247, 252,
253, 254, 255, 256, 257, 258, 260, 261,
262, 263, 264, 273, 274, 275, 276, 278,
279, 280, 282, 283, 284, 285, 286, 287,
288, 290, 291, 292, 293, 294, 295, 296,
297, 299, 300, 303, 304, 305, 306, 307,
309, 311, 313, 314, 319, 322, 324, 329,
335, 336, 339, 342, 343, 345, 346, 347,352, 355, 358, 359, 364, 365, 367, 368,
369, 370, 371, 373, 374, 375, 376, 377,
378, 379, 383, 384, 386, 387, 392, 393,
395, 396, 398, 400, 401, 402, 403, 404,
405, 406, 411, 412, 413, 414, 415, 420,
422, 423, 424, 425, 426, 427, 428, 429,
430, 431, 436, 450, 451, 452, 453, 467,
474, 475, 477, 479, 480, 481, 482, 484,
485, 486, 487, 488, 489, 490, 491, 492,493
described 61, 142, 144, 160, 168,
198, 204, 205, 207, 209, 222, 224, 231,
242, 260, 275, 285, 301, 310, 320, 322,
324, 349, 363, 367, 368, 378, 381, 396,
404, 412, 432, 454, 468, 471, 474, 475,
476, 491
ergodic Markov chain
479, 480, 484Markov chains 477, 478,
479, 480, 481, 482, 484, 485, 486, 492
personalized 484, 485, 486
principal left eigen vector
478, 480
probability vectors 478, 479,
480
steady-state theorem
stochastic matrix 478, 489
teleport operation
477, 480, 481, 482, 483, 484, 485, 486,
492, 493
topic-specific 485Paice stemmer Paice
Paid inclusion 440
Parameter-free compression
Parameterized compression
Parameter tuning
Parameter tying 353
Parametric indexes 115,
116, 117Parametric search
Parser 78, 79, 80, 151, 152,
153, 199
Parsing functions, designing
Parsing modules
Partitional clustering 391,
409, 410, 415
Partition rule Passage retrieval 222, 226
Patent databases 204
Performance 18, 25, 37, 49, 68,
70, 71, 103, 112, 131, 158, 159, 160,
168, 190, 193, 194, 195, 224, 233, 242,
245, 246, 254, 258, 259, 262, 263, 265,
282, 289, 295, 298, 301, 322, 334, 348,
349, 353, 432, 457
Permuterm index 56, 57, 58,59, 62, 67
P e r s o n a l i z e d P a g e R a n k
PageRank 485, 486
Phonetic correction 66
Phrase index 43, 46
Phrase queries 17, 18, 21,
22, 27, 28, 30, 41, 42, 43, 44, 46, 47,
49, 144, 151, 154, 155, 254
Phrase search 44, 113
8/2/2019 Introduction to information retrieval
64/87
556
Pivoted document length normalization
Pivot length
Pointwise mutual information 287, 301
Polytomous classification
321, 343
Polytopes 312
Pooling
Pornography filtering 352
Porter stemmer Porter
38
Positional independence assumption 274, 282, 283, 285,
299
Positional indexes 43, 44,
45, 46, 47, 48, 179
Posterior probability 231,
273, 280
Postfiltering, in k-gram indexes k-gram
Postings in block sort-based indexing
compression and 7, 14, 23,
30, 40, 44, 45, 70, 71, 73, 75, 77, 83,
86, 89, 90, 91, 92, 94, 95, 96, 97, 98,
99, 100, 101, 102, 97, 98, 99, 100, 101,
102, 103, 104, 105, 106, 107, 108, 109,
110, 111, 112, 109, 112, 113, 99, 169,
468, 104, 105, 106, 107, 108, 112, 109,110, 111, 112, 113, 103
defined 2, 3, 4, 17, 22, 25, 26,
29, 42, 52, 55, 61, 62, 63, 79, 85, 90,
92, 94, 100, 104, 110, 118, 121, 125,
127, 133, 145, 160, 161, 162, 163, 165,
166, 167, 168, 170, 173, 180, 188, 195,
204, 208, 211, 213, 215, 216, 217, 219,
220, 221, 230, 234, 235, 237, 259, 262,
264, 271, 272, 276, 280, 283, 284, 287,
290, 292, 300, 301, 307, 308, 317, 318,
319, 321, 324, 325, 327, 329, 334, 335,
336, 341, 344, 345, 346, 352, 353, 358,
359, 365, 368, 369, 371, 372, 374, 377,379, 382, 383, 386, 387, 392, 393, 396,
397, 400, 404, 405, 407, 409, 412, 413,
424, 427, 437, 438, 443, 446, 450, 451,
465, 479, 480, 481, 487, 492
in inverted indexes 1, 4,
7, 8, 9, 10, 11, 17, 19, 22, 40, 48, 52,
55, 56, 57, 58, 66, 67, 70, 72, 73, 74,
75, 76, 77, 79, 82, 84, 87, 90, 92, 105,
106, 107, 108, 109, 111, 113, 117, 119,130, 142, 144, 147, 152, 153, 196, 204,
205, 218, 315, 331, 367, 368, 412, 435,
468
positional 7, 8, 10, 12, 21, 24,
29, 39, 40, 41, 43, 44, 45, 46, 47, 48,
71, 72, 80, 83, 90, 91, 92, 95, 98, 100,
113, 116, 152, 153, 165, 167, 170, 178,
179, 215, 230, 254, 274, 278, 280, 281,
282, 283, 285, 299, 318, 328, 334, 335,350, 354, 366, 376, 434, 443, 444, 461,
469, 477, 479
Postings list
compression of 7, 14, 23, 30,