Upload
patience-nash
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Lattice-Based Statistical Spoken Document
RetrievalChia Tee Kiah
Ph. D. thesisDepartment of Computer Science
School of ComputingNational University of Singapore
Supervisors: A/Prof. Ng Hwee Tou (NUS),Dr. Li Haizhou (I2R)
2
OutlineIntroductionOriginal contributionBackgroundLattice-based SDR under statistical modelOther SDR methodsExperiments on SDR with short queriesQuery-by-example SDRConclusion
3
Introduction:Spoken Document Retrieval
Information retrieval (IR)Search for items of data according to user’s info. need
Spoken document retrieval (SDR)IR on speech recordingsGrowing in importance — more & more speech data stored: news broadcasts, voice mails, …
SDR more difficult than text IRCurrently need automatic speech recognition (ASR)1-best transcripts from ASR are error-prone
Word error rate for noisy, spontaneous speech may be 50%
4
Introduction: Lattices
Lattice — connected directed acyclic graphJames & Young (1994), James (1995)Each edge labeled with term hypothesis, probs.Each path gives hypothesized seq. of terms, & its probability
Use alternative hypotheses to overcome errors in 1-best transcripts — lattice-based SDR
and
it’s
my son’s </s>
<s>
mentor
niceand tender
</s>
to tender
t = 0.00 t = 0.02 t = 0.3t = 0.32t = 0.47t = 0.57 t = 0.65t = 0.69t = 0.72 t = 1.11 t = 1.12
5
Introduction: Lattices
Lattice — connected directed acyclic graphJames & Young (1994), James (1995)Each edge labeled with term hypothesis, probs.Each path gives hypothesized seq. of terms, & its probability
Use alternative hypotheses to overcome errors in 1-best transcripts — lattice-based SDR
and
it’s
my son’s </s>
<s>
mentor
niceand tender
</s>
to tender
t = 0.00 t = 0.02 t = 0.3t = 0.32t = 0.47t = 0.57 t = 0.65t = 0.69t = 0.72 t = 1.11 t = 1.12
6
Outline
IntroductionOriginal contributionBackgroundLattice-based SDR under statistical modelOther SDR methodsExperiments on SDR with short queriesQuery-by-example SDRConclusion
7
Original ContributionA method for lattice-based SDR using a statistical IR model (Song & Croft 1999)
Calculate expected count of each word in each latticeFrom counts, estimate statistical lang. models for docs.Compute query-doc. relevance as probabilityPrevious lattice-based SDR methods all based on vector space IR model!
Extension to query-by-example SDRSDR where queries are also full-fledged spoken docs.
Presented in EMNLP-CoNLL 2007, SIGIR 2008
8
OutlineIntroductionOriginal contributionBackgroundLattice-based SDR under statistical modelOther SDR methodsExperiments on SDR with short queriesQuery-by-example SDRConclusion
9
Background:Information Retrieval
The task of IRGiven: doc. collection C, query q giving info. need
Find list of docs. in C relevant to info. need
Steps involvedBefore receiving query
Document preprocessing — outputs an index for rapid access
Upon receiving queryRetrieval — outputs ranked list of docs.Done by assigning relevance scores; guided by retrieval modelGood IR systems give higher scores to more relevant docs.
“… Nevertheless, ‘information retrieval’ has
become accepted as a description …”
… nevertheless information retrieval has become
accepted as a description …
… information retrieval accepted description …
… inform retriev accept descript …
Tokenization
Stop word removal
Stemming
Indexing
q =“Euclid’s
algorithm”
Index
document: 336, 624, 864, …Inform: 33, 128, 315, …
Retrieval
Docu
men
tp
rep
rocess
in
g
“… an algorithm for finding the greatest common divisor
of two numbers …”
#1#2
#3Ranked list
C
10
Background: IR:Retrieval Models
Vector space with tf· idf weighting (Salton 1963; Spärck Jones 1972)
Docs. & queries are Euclidean vecs.Compute relevance as cosine similarityEach vec. component d(i), q(i) a product of
tf(wi , d) — “term frequency”: increasing func. of no. of occurrences c(wi ; d) of wi in didf(wi) — “inverse doc. frequency”: decreasing func. of no. of docs. containing wi
d
q
τ
Relevance = cos τ
11
Background: IR:Retrieval Models
Okapi BM25 (Robertson et al. 1998)Based on approximation to Harter’s 2-Poisson theory of word distribution (1974) & Robertson/Spärck Jones weight (1976)
Relbm25(d, q) =
|C| = no. of docs in collectionV = vocabularyc(w ; d) = count of w in dc(w ; q) = count of w in qnw = no. of docs containing w
R = no. of docs. known to be rel. rw = no. of rel. docs containing w |d| = length of d avdl = average doc. length k1, k2, k3, b are parms.
12
Background: IR:Retrieval Models
Statistical lang. n-gram (Song & Croft 1999)
Use Pr(d | q) as relevance measureAssuming uniform Pr(d):
Pr(d | q) = Pr(q | d)Pr(d) / Pr(q) Pr(q | d)We can thus define relevance as
Relstat(d, q) = log Pr(q | d)Write q as seq. of words q1q2…qK
Given unigram model Pr(· | d),Relstat(d, q) = log Π1≤i ≤K Pr(qi | d)
= ∑ c(w ; q) log Pr(w | d)
Estimate Pr(· | d) by smoothing word counts
q
d
Pr(q)
Pr(d |q)
13
System evaluationCompare IR engine’s ranked list to ground truth relevance judgementsEval. metric: mean average precision (MAP)
MAP for set of queries Q =
|Q| = no. of queriesRq = no. of docs. rel. to query qr'j, q = position of jth rel. doc. in ranked list output for query q
Intuitively, higher MAP means relevant docs. ranked higher
Background:Information Retrieval
14
Background:Automatic Speech Recognition
ASR transcribes speech waveform into text; involvesPronouncing dictionary — maps written words to phonemes
Phoneme: contrastive speech unit: /ae/, /ow/, /th/, /p/, …
Acoustic models — describe acoustic realizations of phonemesEach model usually for a triphone — phoneme in the context of 2 phonemes
Language model — gives word transition probabilities
15
Background:Automatic Speech Recognition
General paradigm: hidden Markov models (HMMs)
Acoustic models: left-right triphone HMMs, trained using EM algo.Using lang. model & pron. dict., join HMMs into one large utterance HMMDecoding: find ‘most probable’ transcript — Viterbi search with beam pruning (Ney et al. 1992)
Lattices: computed using extension of decoding
ASR system evaluation — word error rate (WER)
Edit dist. / ref. trans. lengthOther metrics: char. error rate, syll. error rate
Structure of a typical triphone HMM
16
IR with collection of speech recordingsASR engine produces document surrogates — may be
1-best word transcripts (e.g. Gauvain et al. 2000)1-best subword transcripts (e.g. Turunen & Kurimo 2006)Phoneme lattices (e.g. James 1995; Jones et al. 1996)N-best transcript lists (Siegler 1999)Word lattices (e.g. Mamou et al. 2006)Phoneme & word lattices (e.g. Saraclar & Sproat 2004)
IR models used in SDRFor SDR with 1-best transcripts: vector space, BM25, statistical IR models have been triedFor lattice-based SDR: only vector space model
Background:Spoken Document Retrieval
17
IR where queries & docs. are of like formQueries are exemplars of type of objects soughtE.g. music (“query by humming”) (Zhu et al. 2003); images (Vu et al. 2003)
Work related to query-by-example SDRQuery by example for speech & text
He et al. (2003); Lo & Gauvain (2002, 2003): tracking task in Topic Detection & Tracking (TDT)Chen et al. (2004): newswire articles (text) for queries, broadcasts (speech) for docs.All using 1-best transcripts
Lattices of short spoken queries for IRColineau & Halber (1999)
Background:Query By Example
18
OutlineIntroductionOriginal contributionBackgroundLattice-based SDR under statistical modelOther SDR methodsExperiments on SDR with short queriesQuery-by-example SDRConclusion
19
Lattice-Based SDRUnder the Statistical Model
Song & Croft’s IR modelRelstat(d, q) = log Pr(q | d) = ∑ c(w ; q) log Pr(w | d)
Our idea: estimate Pr(· | d) from latticesFind expectations of word counts (Saraclar & Sproat 2004) & doc. lengths
E[c(w ; d)] = ∑t c(w ; t)Pr(t | d)E[|d|] = ∑t |t|Pr(t | d)
Expected counts can be computed efficiently by dynamic programming (Hatch et al. 2005)
20
w2
Lattice-Based SDRUnder the Statistical Model
The method1. Start with speech seg.’s
acoustic observations o2. Generate lattice using ASR
Decoding with adaptation of Viterbi algo.; keep track of multiple paths (James 1995)Use simple lang. model (bigram LM)
3. Rescore with more complex LM (trigram LM)
Replace bigram LM probs. with trigram probs.Make duplicates of nodes with differing trigram contexts
o1 o2 o3o =
w1/Pr(o1|w1),Pr(w1|<s>) w3
w4
w4/Pr(o3|w4),Pr(w4|w3)
w2/Pr(o2o3|w2),Pr(w2|w2)
w4
w2
w3
w3
w3
Aco
ustic
ob
serva
tion
s
Latice
from
d
eco
din
gw
ith sim
ple
LM
Lattice
re
score
dw
ith co
mp
lex
LM
w1/Pr(o1|w1),Pr(w1|<s>)
w4/Pr(o3|w4),Pr(w4 |w1w3)
w3
w3
w4/Pr(o3|w4),Pr(w4|w2w3)w2/Pr(o2o3|w2),Pr(w2|<s>w2)
w4
21
The method4. Combine acoustic & LM
probs.In practice, apply grammar scale factor ω, word insertion penalty ρ
5. Prune latticeRemove paths whose log probs. exceed best path’s by Θdoc
6. Find expectations of word counts E[c(w ; o)], seg. lengths E[|o|]
7. Combine expected counts to get E[c(w ; d)], E[|d|]
Lattice-Based SDRUnder the Statistical Model
Pru
ne
dla
ttice
o1 o2 o3o =
w4/p1
w2/p2
w3/p3
w2/p5
w4/p4
Word Expected count
w2 2p2p5/(p1p3p4+p2p5)
w3 p1p3p4/(p1p3p4+p2p5)
w4 2p1p3p4/(p1p3p4+p2p5)
Exp
ecte
dco
un
ts
w2/p2
Lattice
with
co
mb
ined
aco
ustic &
LM
p
rob
s.
w2/p3
w4/p4w3/p3w4/p1
(p1 = Pr(w1|<s>)[Pr(o1|w1)eρ]1/ω)w3/p9
w1/p6w3/p7
w4/p8
w4/p10
22
Lattice-Based SDRUnder the Statistical Model
The method8. Build unigram model to get Pr(· | d)
Zhai & Lafferty’s (2004) 2-stage smoothing methodCombination of Jelinek-Mercer & Bayesian smoothing
Adapt 2-stage smoothing to use expected counts
w is a word e.g. query wordU a background language modelλ (0, 1) set according to nature of queriesμ set using variation of Zhai & Lafferty’s estimation algo.
9. Thus we can computeRelstat(d, q) = log Pr(q | d) = ∑ c(w ; q) log Pr(w | d)
23
OutlineIntroductionOriginal contributionBackgroundLattice-based SDR under statistical modelOther SDR methodsExperiments on SDR with short queriesQuery-by-example SDRConclusion
24
Other SDR MethodsStatistical, using 1-best transcripts
Motivated by Song & Croft (1999), Chen et al. (2004)
Vector space, using latticesMamou et al. (2006)
BM25, using lattices
25
Other SDR Methods:Statistical, Using 1-Best Trans.
Estimate Pr(· | d) from 1-best transcriptUse Zhai & Lafferty’s 2-stage smoothing:
w is a word e.g. query wordc1-best(w ; d) = count of w in d’s 1-best transcript|d1-best| = length of d’s transcript
U a background language modelλ (0, 1), μ > 0 are smoothing parameters
Compute relevanceRelstat(d, q) = log Pr(q | d) = ∑ c(w ; q) log
Pr(w | d)
26
Other SDR Methods:Vector Space, Using Lattices
Mamou et al. (2006)Method1. Compute word confusion
network (Mangu et al. 2000)
Sequence of confusion sets2. Compute term freq.
vectorWeight of each term depends on ranks & probs. in confusion sets, freq. in doc. collection
3. Compute relevanceConstruct d & q vectors, compute cosine similarity
o1 o2 o3o =
Word
co
nfu
sion
netw
ork
w2
w4
w3
w2
w4
ε
Pru
ne
dla
ttice
w4/p1
w2/p2
w3/p3
w2/p5
w4/p4
Docu
men
t &q
uery
vecto
rs
d
q
τ
g1 g2 g3
27
Other SDR Methods:BM25, Using Lattices
Modify Robertson et al.’s BM25 formula to use expected counts
Relbm25, lat(d, q) =
Estimate doc. freq. nw* from expected counts
(Turunen & Kurimo 2007)
28
OutlineIntroductionOriginal contributionBackgroundLattice-based SDR under statistical modelOther SDR methodsExperiments on SDR with short queriesQuery-by-example SDRConclusion
29
SDR Experiments:Mandarin Chinese Task: Setup
Doc. collectionHub5 Mandarin training corpus (LDC98T26)
42 telephone calls in Mandarin Chinese, total 17 hours, ≈ 600Kb text
Unit of retrieval (“document”)½-minute time windows with 50% overlap (Abberley et al. 1998; Tuerk et al. 2001)4,312 retrieval units
Queries18 keyword queries — 14 test, 4 devel.
Ground truth relevance judgements
Determined manually
30
SDR Experiments:Mandarin Chinese Task: Details
LatticesGenerated by Abacus (Hon et al. 1994)
Large vocab. triphone-based cont. speech recognizer
Rescored with trigram language modelTrained with TDT, Callhome, CSTSC-Flight corpora
1-best transcriptsDecoded from rescored lattices
Other tools usedAT&T FSM (Mohri et al. 1998)SRILM (Stolcke 2002)Low et al.’s (2005) Chinese word segmenter
31
SDR Experiments:Mandarin Chinese Task
RetrievalSDR performed using
baseline stat. method, on ref. transcriptsbaseline stat. method, on 1-best transcriptsMamou et al.’s vector space method, on latticesour proposed method, on lattices
Smoothing parameterλ = 0.1 — good for keyword queries (Zhai & Lafferty 2004)
Lattice pruning threshold Θ = 10000.5 Θdoc
Vary Θ on devel. queries, use best value on test queries
Evaluation measure: mean avg. prec. (MAP)
~~
32
SDR Experiments:Mandarin Chinese Task: Results
Results for statistical methods1-best MAP was 0.1364; ref. MAP was 0.4798Lattice-based MAP for devel. queries highest at Θ = 65,000At this point, MAP for test queries was 0.2154
MAP for 4 devel. queries MAP for 14 test queries
~
33
SDR Experiments:Mandarin Chinese Task: Results
MAP for 4 devel. queries MAP for 14 test queries
Results for Mamou et al.’s vector space methodMAP for devel. queries highest at Θ = 27,500At this point, MAP for test queries was 0.1599
~
34
SDR Experiments:Mandarin Chinese Task: Results
0.50
52
0.12
51
0.16
85
0.21
80
0.47
98
0.13
64
0.15
99
0.21
54
00.10.20.30.40.50.6
Statistical, withref. transcripts
Statistical, with1-best
transcripts
Vector space,with lattices
Statistical, withlattices (ourmethod)
MA
P
MAP for 4 devel. queries MAP for 14 test queries
Statistical significance testing — 1-tailed t-testImprovement over 1-best: significant at 99.5% levelImprovement over vector space: significant at 97.5% level
Our method outperforms stat. 1-best & vec. space with lat.
35
Corpus: Fisher English Training corpus from LDC
11,699 telephone calls, total 1,920 hours, ≈ 109Mb textEach call initiated by one of 40 topics6,605 calls for training ASR engine
QueriesThe 40 topic specifications32 test, 8 devel.
Doc. collection5,094 callsUnit of retrieval (“document”): a call
Ground truth rel. judgementsd rel. to q iff conversation d was initiated by topic q
SDR Experiments:English Task: Setup
ENG01. Professional sports on TV. Do either of you have a favorite TV sport? How many hours per week do you spend watching it and other sporting events on TV?Example of a topic spec.
36
SDR Experiments:English Task: Details
LatticesGenerated by HTK (Young et al., 2006)
Large vocab. triphone-based cont. speech recognizerTried trigram LM rescoring, & decoding only with bigram LM
1-best transcriptsDecoded from rescored latticesWord error rate: 48.1% (with rescoring), 50.8% (without)
Words stemmed with Porter stemmerAlso tried stop word removal — experimented with
no stoppingstopping with 319-word list from U. of Glasgow (gla)stopping with 571-word list used in SMART system (smart)
Index building: used CMU Lemur toolkit
37
SDR Experiments:English Task
RetrievalPerformed using
baseline stat. method, on ref. transcriptsbaseline stat. method, on 1-best transcriptsMamou et al.’s vector space method, on latticesBM25 method, on latticesour proposed method, on lattices
Retrieval parametersFor stat. methods: λ = 0.7 — good for verbose queriesFor BM25: k1 = 1.2, b = 0.75, k2 = 0 (following Robertson et al. (1998)); θ, k3 tuned with devel. queries
Evaluation measure: MAP
38
SDR Experiments:English Task: Results
0.60
20
0.68
58
0.67
73
0.74
99
0.76
11
0.62
46
0.68
76 0.71
39
0.76
30
0.77
17
0.6
0.65
0.7
0.75
0.8
Vector space, glastop list
Vector space, smartstop list
BM25 Statistical, no latticerescoring
Statistical
MA
P
MAP for test queries, using 1-best transcripts MAP for test queries, using lattices
Main findingsOur method outperforms 1-best stat. SDR, Mamou et al.’s vector space method, & BM25Unlike Mamou et al., does not need stop word removalRescoring lattices with trigram LM helps improve SDR
39
OutlineIntroductionOriginal contributionBackgroundLattice-based SDR under statistical modelOther SDR methodsExperiments on SDR with short queriesQuery-by-example SDRConclusion
40
Query-By-Example SDRThe task
Given collection C of spoken docs., query exemplar q (also a spoken doc.)Task: find docs. in coll. on similar topic as query
Extending our stat. lat.-based SDR method to query-by-example — additional challenges
Problem #1: How to cope with uncertainty in ASR transcription of q?Problem #2: How to handle high concentration of non-content words in q?
41
Query-By-ExampleSDR: Problems
Problem #1: Uncertainty in transcription of q
Use multiple ASR hypotheses for qReformulate 1-best stat. IR as negative Kullback-Leibler divergence ranking (Lafferty & Zhai 2001):
-ΔKL(q, d) = log Pr(q | d)
Thus, we can estimate models Pr(· | d) & Pr(· | q) from d & q lats., rank docs. by neg. KL div.
Problem #2: Lots of non-content words in q
Use stop word removal
rank
42
Query-By-ExampleSDR: Proposed Method
1. Get lattices for d & q, rescore, prune, find expected counts
Use 2 pruning thresholds: Θdoc for docs., Θqry for queries2. Build unigram model of d
With expected countsAgain, use 2-stage smoothing (Zhai & Lafferty 2004)
3. Build unigram model of q — unsmoothed
4. Compute relevance as neg. KL div.(Lafferty & Zhai 2001)
Relstat-qbe(d, q) = ∑w Pr(w | q) log Pr(w | d)
43
Query-By-ExampleSDR: Experiments
Corpus: Fisher English Training corpusQueries
40 exemplars — 32 test, 8 devel. — for 40 topics
Doc. collection5,054 telephone calls
Ground truth rel. judgements
d rel. to q iff d & q on same topic
Smoothing parameterλ = 0.7
Lattice pruning thresholds Θdoc and Θqry
Varied independently on devel. queries
Stop word removal: used
no stoppingstopping with gla stop liststopping with smart stop list
44
Query-By-ExampleSDR: Experiments
Retrieval performed using1-best trans. of exemplars & docs. (1-best → 1-best)exemplar 1-best, doc. lat. (1-best → Lat)exemplar lat., doc. 1-best (Lat → 1-best)lat. counts of exemplars and docs. (Lat → Lat): our proposed methodAlso tried
ref. trans. of exemplars. & docs. (Ref → Ref)orig. Fisher topic spec. for queries (Top → Ref, Top → 1-best, Top → Lat)
Evaluation measure: MAP
45
Query-By-ExampleSDR: Experimental Results
0.81
49
0.76
13
0.77
23
0.74
68
0.69
58
0.70
09
0.70
23
0.70
79
0.680.7
0.720.740.760.780.8
0.82
Orig. Fisher topic specs., no stopping Exemplars, no stopping
MA
P of
test q
uerie
s
Top > 1-best Top > Lat Top > Ref Ref > Ref 1-best > 1-best
1-best > Lat Lat > 1-best Lat > Lat
MAP without stop word removalStat. significance testing — 1-tailed t-test, Wilcoxon testLat → Lat vs. 1-best → 1-best: improvement sig. at 99.95% levelHowever, original topic specs. still better — nature of exemplars presents difficulties for retrieval
46
Query-By-ExampleSDR: Experimental Results
0.74
68 0.76
30 0.77
81
0.69
58
0.71
93
0.74
06
0.70
09
0.72
83 0.74
99
0.70
23
0.72
85 0.74
87
0.70
79
0.73
64 0.75
69
0.68
0.7
0.72
0.74
0.76
0.78
0.8
Exemplars, no stopping Exemplars, gla stop list Exemplars, smart stop list
MA
P of
test
que
ries
Ref > Ref 1-best > 1-best 1-best > Lat Lat > 1-best Lat > Lat
MAP with stop word removalWith gla stop list: Lat → Lat better than 1-best → 1-best at 99.99% levelWith smart stop list: better at 99.95% level
Our method (Lat → Lat) gives consistent improvement
47
OutlineIntroductionOriginal contributionBackgroundLattice-based SDR under statistical modelOther SDR methodsExperiments on SDR with short queriesQuery-by-example SDRConclusion
48
ConclusionContributions
Proposed novel SDR method — combines use of lattices & stat. IR model
Motivated by improved IR accuracy when each technique was used individuallyNew method performs well compared to previous methods & lattice-based BM25
Extended proposed method to query-by-example SDR
Lat.-based query by example, under stat. IR modelSignificant improvement over using 1-best trans.Consistently better, under variety of setups
49
Conclusion
Suggestions for future workIncorporate proximity-based search into our method
Formulate a more principled way of deriving lattice pruning thresholds
Examine how stop words affect SDR & query by example
Extend stat. lat.-based SDR framework to other speech processing tasks, e.g. spoken document classification
50
Thank you!