Lattice-Based Statistical Spoken Document Retrieval Chia Tee Kiah Ph. D. thesis Department of Computer Science School of Computing National University

Lattice-Based Statistical Spoken Document

RetrievalChia Tee Kiah

Ph. D. thesisDepartment of Computer Science

School of ComputingNational University of Singapore

Supervisors: A/Prof. Ng Hwee Tou (NUS),Dr. Li Haizhou (I2R)

2

OutlineIntroductionOriginal contributionBackgroundLattice-based SDR under statistical modelOther SDR methodsExperiments on SDR with short queriesQuery-by-example SDRConclusion

3

Introduction:Spoken Document Retrieval

Information retrieval (IR)Search for items of data according to user’s info. need

Spoken document retrieval (SDR)IR on speech recordingsGrowing in importance — more & more speech data stored: news broadcasts, voice mails, …

SDR more difficult than text IRCurrently need automatic speech recognition (ASR)1-best transcripts from ASR are error-prone

Word error rate for noisy, spontaneous speech may be 50%

4

Introduction: Lattices

Lattice — connected directed acyclic graphJames & Young (1994), James (1995)Each edge labeled with term hypothesis, probs.Each path gives hypothesized seq. of terms, & its probability

Use alternative hypotheses to overcome errors in 1-best transcripts — lattice-based SDR

and

it’s

my son’s </s>

<s>

mentor

niceand tender

</s>

to tender

t = 0.00 t = 0.02 t = 0.3t = 0.32t = 0.47t = 0.57 t = 0.65t = 0.69t = 0.72 t = 1.11 t = 1.12

5

Introduction: Lattices

Lattice — connected directed acyclic graphJames & Young (1994), James (1995)Each edge labeled with term hypothesis, probs.Each path gives hypothesized seq. of terms, & its probability

Use alternative hypotheses to overcome errors in 1-best transcripts — lattice-based SDR

and

it’s

my son’s </s>

<s>

mentor

niceand tender

</s>

to tender

t = 0.00 t = 0.02 t = 0.3t = 0.32t = 0.47t = 0.57 t = 0.65t = 0.69t = 0.72 t = 1.11 t = 1.12

6

Outline

IntroductionOriginal contributionBackgroundLattice-based SDR under statistical modelOther SDR methodsExperiments on SDR with short queriesQuery-by-example SDRConclusion

7

Original ContributionA method for lattice-based SDR using a statistical IR model (Song & Croft 1999)

Calculate expected count of each word in each latticeFrom counts, estimate statistical lang. models for docs.Compute query-doc. relevance as probabilityPrevious lattice-based SDR methods all based on vector space IR model!

Extension to query-by-example SDRSDR where queries are also full-fledged spoken docs.

Presented in EMNLP-CoNLL 2007, SIGIR 2008

8


9

Background:Information Retrieval

The task of IRGiven: doc. collection C, query q giving info. need

Find list of docs. in C relevant to info. need

Steps involvedBefore receiving query

Document preprocessing — outputs an index for rapid access

Upon receiving queryRetrieval — outputs ranked list of docs.Done by assigning relevance scores; guided by retrieval modelGood IR systems give higher scores to more relevant docs.

“… Nevertheless, ‘information retrieval’ has

become accepted as a description …”

… nevertheless information retrieval has become

accepted as a description …

… information retrieval accepted description …

… inform retriev accept descript …

Tokenization

Stop word removal

Stemming

Indexing

q =“Euclid’s

algorithm”

Index

document: 336, 624, 864, …Inform: 33, 128, 315, …

Retrieval

Docu

men

tp

rep

rocess

in

g

“… an algorithm for finding the greatest common divisor

of two numbers …”

#1#2

#3Ranked list

C

10

Background: IR:Retrieval Models

Vector space with tf· idf weighting (Salton 1963; Spärck Jones 1972)

Docs. & queries are Euclidean vecs.Compute relevance as cosine similarityEach vec. component d(i), q(i) a product of

tf(wi , d) — “term frequency”: increasing func. of no. of occurrences c(wi ; d) of wi in didf(wi) — “inverse doc. frequency”: decreasing func. of no. of docs. containing wi

d

q

τ

Relevance = cos τ

11


Okapi BM25 (Robertson et al. 1998)Based on approximation to Harter’s 2-Poisson theory of word distribution (1974) & Robertson/Spärck Jones weight (1976)

Relbm25(d, q) =

|C| = no. of docs in collectionV = vocabularyc(w ; d) = count of w in dc(w ; q) = count of w in qnw = no. of docs containing w

R = no. of docs. known to be rel. rw = no. of rel. docs containing w |d| = length of d avdl = average doc. length k1, k2, k3, b are parms.

12


Statistical lang. n-gram (Song & Croft 1999)

Use Pr(d | q) as relevance measureAssuming uniform Pr(d):

Pr(d | q) = Pr(q | d)Pr(d) / Pr(q) Pr(q | d)We can thus define relevance as

Relstat(d, q) = log Pr(q | d)Write q as seq. of words q1q2…qK

Given unigram model Pr(· | d),Relstat(d, q) = log Π1≤i ≤K Pr(qi | d)

= ∑ c(w ; q) log Pr(w | d)

Estimate Pr(· | d) by smoothing word counts

q

d

Pr(q)

Pr(d |q)

13

System evaluationCompare IR engine’s ranked list to ground truth relevance judgementsEval. metric: mean average precision (MAP)

MAP for set of queries Q =

|Q| = no. of queriesRq = no. of docs. rel. to query qr'j, q = position of jth rel. doc. in ranked list output for query q

Intuitively, higher MAP means relevant docs. ranked higher

Background:Information Retrieval

14

Background:Automatic Speech Recognition

ASR transcribes speech waveform into text; involvesPronouncing dictionary — maps written words to phonemes

Phoneme: contrastive speech unit: /ae/, /ow/, /th/, /p/, …

Acoustic models — describe acoustic realizations of phonemesEach model usually for a triphone — phoneme in the context of 2 phonemes

Language model — gives word transition probabilities

15

Background:Automatic Speech Recognition

General paradigm: hidden Markov models (HMMs)

Acoustic models: left-right triphone HMMs, trained using EM algo.Using lang. model & pron. dict., join HMMs into one large utterance HMMDecoding: find ‘most probable’ transcript — Viterbi search with beam pruning (Ney et al. 1992)

Lattices: computed using extension of decoding

ASR system evaluation — word error rate (WER)

Edit dist. / ref. trans. lengthOther metrics: char. error rate, syll. error rate

Structure of a typical triphone HMM

16

IR with collection of speech recordingsASR engine produces document surrogates — may be

1-best word transcripts (e.g. Gauvain et al. 2000)1-best subword transcripts (e.g. Turunen & Kurimo 2006)Phoneme lattices (e.g. James 1995; Jones et al. 1996)N-best transcript lists (Siegler 1999)Word lattices (e.g. Mamou et al. 2006)Phoneme & word lattices (e.g. Saraclar & Sproat 2004)

IR models used in SDRFor SDR with 1-best transcripts: vector space, BM25, statistical IR models have been triedFor lattice-based SDR: only vector space model

Background:Spoken Document Retrieval

17

IR where queries & docs. are of like formQueries are exemplars of type of objects soughtE.g. music (“query by humming”) (Zhu et al. 2003); images (Vu et al. 2003)

Work related to query-by-example SDRQuery by example for speech & text

He et al. (2003); Lo & Gauvain (2002, 2003): tracking task in Topic Detection & Tracking (TDT)Chen et al. (2004): newswire articles (text) for queries, broadcasts (speech) for docs.All using 1-best transcripts

Lattices of short spoken queries for IRColineau & Halber (1999)

Background:Query By Example

18


19

Lattice-Based SDRUnder the Statistical Model

Song & Croft’s IR modelRelstat(d, q) = log Pr(q | d) = ∑ c(w ; q) log Pr(w | d)

Our idea: estimate Pr(· | d) from latticesFind expectations of word counts (Saraclar & Sproat 2004) & doc. lengths

E[c(w ; d)] = ∑t c(w ; t)Pr(t | d)E[|d|] = ∑t |t|Pr(t | d)

Expected counts can be computed efficiently by dynamic programming (Hatch et al. 2005)

20

w2


The method1. Start with speech seg.’s

acoustic observations o2. Generate lattice using ASR

Decoding with adaptation of Viterbi algo.; keep track of multiple paths (James 1995)Use simple lang. model (bigram LM)

3. Rescore with more complex LM (trigram LM)

Replace bigram LM probs. with trigram probs.Make duplicates of nodes with differing trigram contexts

o1 o2 o3o =

w1/Pr(o1|w1),Pr(w1|<s>) w3

w4

w4/Pr(o3|w4),Pr(w4|w3)

w2/Pr(o2o3|w2),Pr(w2|w2)

w4

w2

w3

w3

w3

Aco

ustic

ob

serva

tion

s

Latice

from

d

eco

din

gw

ith sim

ple

LM

Lattice

re

score

dw

ith co

mp

lex

LM

w1/Pr(o1|w1),Pr(w1|<s>)

w4/Pr(o3|w4),Pr(w4 |w1w3)

w3

w3

w4/Pr(o3|w4),Pr(w4|w2w3)w2/Pr(o2o3|w2),Pr(w2|<s>w2)

w4

21

The method4. Combine acoustic & LM

probs.In practice, apply grammar scale factor ω, word insertion penalty ρ

5. Prune latticeRemove paths whose log probs. exceed best path’s by Θdoc

6. Find expectations of word counts E[c(w ; o)], seg. lengths E[|o|]

7. Combine expected counts to get E[c(w ; d)], E[|d|]


Pru

ne

dla

ttice

o1 o2 o3o =

w4/p1

w2/p2

w3/p3

w2/p5

w4/p4

Word Expected count

w2 2p2p5/(p1p3p4+p2p5)

w3 p1p3p4/(p1p3p4+p2p5)

w4 2p1p3p4/(p1p3p4+p2p5)

Exp

ecte

dco

un

ts

w2/p2

Lattice

with

co

mb

ined

aco

ustic &

LM

p

rob

s.

w2/p3

w4/p4w3/p3w4/p1

(p1 = Pr(w1|<s>)[Pr(o1|w1)eρ]1/ω)w3/p9

w1/p6w3/p7

w4/p8

w4/p10

22


The method8. Build unigram model to get Pr(· | d)

Zhai & Lafferty’s (2004) 2-stage smoothing methodCombination of Jelinek-Mercer & Bayesian smoothing

Adapt 2-stage smoothing to use expected counts

w is a word e.g. query wordU a background language modelλ (0, 1) set according to nature of queriesμ set using variation of Zhai & Lafferty’s estimation algo.

9. Thus we can computeRelstat(d, q) = log Pr(q | d) = ∑ c(w ; q) log Pr(w | d)

23


24

Other SDR MethodsStatistical, using 1-best transcripts

Motivated by Song & Croft (1999), Chen et al. (2004)

Vector space, using latticesMamou et al. (2006)

BM25, using lattices

25

Other SDR Methods:Statistical, Using 1-Best Trans.

Estimate Pr(· | d) from 1-best transcriptUse Zhai & Lafferty’s 2-stage smoothing:

w is a word e.g. query wordc1-best(w ; d) = count of w in d’s 1-best transcript|d1-best| = length of d’s transcript

U a background language modelλ (0, 1), μ > 0 are smoothing parameters

Compute relevanceRelstat(d, q) = log Pr(q | d) = ∑ c(w ; q) log

Pr(w | d)

26

Other SDR Methods:Vector Space, Using Lattices

Mamou et al. (2006)Method1. Compute word confusion

network (Mangu et al. 2000)

Sequence of confusion sets2. Compute term freq.

vectorWeight of each term depends on ranks & probs. in confusion sets, freq. in doc. collection

3. Compute relevanceConstruct d & q vectors, compute cosine similarity

o1 o2 o3o =

Word

co

nfu

sion

netw

ork

w2

w4

w3

w2

w4

ε

Pru

ne

dla

ttice

w4/p1

w2/p2

w3/p3

w2/p5

w4/p4

Docu

men

t &q

uery

vecto

rs

d

q

τ

g1 g2 g3

27

Other SDR Methods:BM25, Using Lattices

Modify Robertson et al.’s BM25 formula to use expected counts

Relbm25, lat(d, q) =

Estimate doc. freq. nw* from expected counts

(Turunen & Kurimo 2007)

28


29

SDR Experiments:Mandarin Chinese Task: Setup

Doc. collectionHub5 Mandarin training corpus (LDC98T26)

42 telephone calls in Mandarin Chinese, total 17 hours, ≈ 600Kb text

Unit of retrieval (“document”)½-minute time windows with 50% overlap (Abberley et al. 1998; Tuerk et al. 2001)4,312 retrieval units

Queries18 keyword queries — 14 test, 4 devel.

Ground truth relevance judgements

Determined manually

30

SDR Experiments:Mandarin Chinese Task: Details

LatticesGenerated by Abacus (Hon et al. 1994)

Large vocab. triphone-based cont. speech recognizer

Rescored with trigram language modelTrained with TDT, Callhome, CSTSC-Flight corpora

1-best transcriptsDecoded from rescored lattices

Other tools usedAT&T FSM (Mohri et al. 1998)SRILM (Stolcke 2002)Low et al.’s (2005) Chinese word segmenter

31

SDR Experiments:Mandarin Chinese Task

RetrievalSDR performed using

baseline stat. method, on ref. transcriptsbaseline stat. method, on 1-best transcriptsMamou et al.’s vector space method, on latticesour proposed method, on lattices

Smoothing parameterλ = 0.1 — good for keyword queries (Zhai & Lafferty 2004)

Lattice pruning threshold Θ = 10000.5 Θdoc

Vary Θ on devel. queries, use best value on test queries

Evaluation measure: mean avg. prec. (MAP)

~~

32

SDR Experiments:Mandarin Chinese Task: Results

Results for statistical methods1-best MAP was 0.1364; ref. MAP was 0.4798Lattice-based MAP for devel. queries highest at Θ = 65,000At this point, MAP for test queries was 0.2154

MAP for 4 devel. queries MAP for 14 test queries

~

33



Results for Mamou et al.’s vector space methodMAP for devel. queries highest at Θ = 27,500At this point, MAP for test queries was 0.1599

~

34


0.50

52

0.12

51

0.16

85

0.21

80

0.47

98

0.13

64

0.15

99

0.21

54

00.10.20.30.40.50.6

Statistical, withref. transcripts

Statistical, with1-best

transcripts

Vector space,with lattices

Statistical, withlattices (ourmethod)

MA

P


Statistical significance testing — 1-tailed t-testImprovement over 1-best: significant at 99.5% levelImprovement over vector space: significant at 97.5% level

Our method outperforms stat. 1-best & vec. space with lat.

35

Corpus: Fisher English Training corpus from LDC

11,699 telephone calls, total 1,920 hours, ≈ 109Mb textEach call initiated by one of 40 topics6,605 calls for training ASR engine

QueriesThe 40 topic specifications32 test, 8 devel.

Doc. collection5,094 callsUnit of retrieval (“document”): a call

Ground truth rel. judgementsd rel. to q iff conversation d was initiated by topic q

SDR Experiments:English Task: Setup

ENG01. Professional sports on TV. Do either of you have a favorite TV sport? How many hours per week do you spend watching it and other sporting events on TV?Example of a topic spec.

36

SDR Experiments:English Task: Details

LatticesGenerated by HTK (Young et al., 2006)

Large vocab. triphone-based cont. speech recognizerTried trigram LM rescoring, & decoding only with bigram LM

1-best transcriptsDecoded from rescored latticesWord error rate: 48.1% (with rescoring), 50.8% (without)

Words stemmed with Porter stemmerAlso tried stop word removal — experimented with

no stoppingstopping with 319-word list from U. of Glasgow (gla)stopping with 571-word list used in SMART system (smart)

Index building: used CMU Lemur toolkit

37

SDR Experiments:English Task

RetrievalPerformed using

baseline stat. method, on ref. transcriptsbaseline stat. method, on 1-best transcriptsMamou et al.’s vector space method, on latticesBM25 method, on latticesour proposed method, on lattices

Retrieval parametersFor stat. methods: λ = 0.7 — good for verbose queriesFor BM25: k1 = 1.2, b = 0.75, k2 = 0 (following Robertson et al. (1998)); θ, k3 tuned with devel. queries

Evaluation measure: MAP

38

SDR Experiments:English Task: Results

0.60

20

0.68

58

0.67

73

0.74

99

0.76

11

0.62

46

0.68

76 0.71

39

0.76

30

0.77

17

0.6

0.65

0.7

0.75

0.8

Vector space, glastop list

Vector space, smartstop list

BM25 Statistical, no latticerescoring

Statistical

MA

P

MAP for test queries, using 1-best transcripts MAP for test queries, using lattices

Main findingsOur method outperforms 1-best stat. SDR, Mamou et al.’s vector space method, & BM25Unlike Mamou et al., does not need stop word removalRescoring lattices with trigram LM helps improve SDR

39


40

Query-By-Example SDRThe task

Given collection C of spoken docs., query exemplar q (also a spoken doc.)Task: find docs. in coll. on similar topic as query

Extending our stat. lat.-based SDR method to query-by-example — additional challenges

Problem #1: How to cope with uncertainty in ASR transcription of q?Problem #2: How to handle high concentration of non-content words in q?

41

Query-By-ExampleSDR: Problems

Problem #1: Uncertainty in transcription of q

Use multiple ASR hypotheses for qReformulate 1-best stat. IR as negative Kullback-Leibler divergence ranking (Lafferty & Zhai 2001):

-ΔKL(q, d) = log Pr(q | d)

Thus, we can estimate models Pr(· | d) & Pr(· | q) from d & q lats., rank docs. by neg. KL div.

Problem #2: Lots of non-content words in q

Use stop word removal

rank

42

Query-By-ExampleSDR: Proposed Method

1. Get lattices for d & q, rescore, prune, find expected counts

Use 2 pruning thresholds: Θdoc for docs., Θqry for queries2. Build unigram model of d

With expected countsAgain, use 2-stage smoothing (Zhai & Lafferty 2004)

3. Build unigram model of q — unsmoothed

4. Compute relevance as neg. KL div.(Lafferty & Zhai 2001)

Relstat-qbe(d, q) = ∑w Pr(w | q) log Pr(w | d)

43

Query-By-ExampleSDR: Experiments

Corpus: Fisher English Training corpusQueries

40 exemplars — 32 test, 8 devel. — for 40 topics

Doc. collection5,054 telephone calls

Ground truth rel. judgements

d rel. to q iff d & q on same topic

Smoothing parameterλ = 0.7

Lattice pruning thresholds Θdoc and Θqry

Varied independently on devel. queries

Stop word removal: used

no stoppingstopping with gla stop liststopping with smart stop list

44

Query-By-ExampleSDR: Experiments

Retrieval performed using1-best trans. of exemplars & docs. (1-best → 1-best)exemplar 1-best, doc. lat. (1-best → Lat)exemplar lat., doc. 1-best (Lat → 1-best)lat. counts of exemplars and docs. (Lat → Lat): our proposed methodAlso tried

ref. trans. of exemplars. & docs. (Ref → Ref)orig. Fisher topic spec. for queries (Top → Ref, Top → 1-best, Top → Lat)

Evaluation measure: MAP

45

Query-By-ExampleSDR: Experimental Results

0.81

49

0.76

13

0.77

23

0.74

68

0.69

58

0.70

09

0.70

23

0.70

79

0.680.7

0.720.740.760.780.8

0.82

Orig. Fisher topic specs., no stopping Exemplars, no stopping

MA

P of

test q

uerie

s

Top > 1-best Top > Lat Top > Ref Ref > Ref 1-best > 1-best

1-best > Lat Lat > 1-best Lat > Lat

MAP without stop word removalStat. significance testing — 1-tailed t-test, Wilcoxon testLat → Lat vs. 1-best → 1-best: improvement sig. at 99.95% levelHowever, original topic specs. still better — nature of exemplars presents difficulties for retrieval

46

Query-By-ExampleSDR: Experimental Results

0.74

68 0.76

30 0.77

81

0.69

58

0.71

93

0.74

06

0.70

09

0.72

83 0.74

99

0.70

23

0.72

85 0.74

87

0.70

79

0.73

64 0.75

69

0.68

0.7

0.72

0.74

0.76

0.78

0.8

Exemplars, no stopping Exemplars, gla stop list Exemplars, smart stop list

MA

P of

test

que

ries

Ref > Ref 1-best > 1-best 1-best > Lat Lat > 1-best Lat > Lat

MAP with stop word removalWith gla stop list: Lat → Lat better than 1-best → 1-best at 99.99% levelWith smart stop list: better at 99.95% level

Our method (Lat → Lat) gives consistent improvement

47


48

ConclusionContributions

Proposed novel SDR method — combines use of lattices & stat. IR model

Motivated by improved IR accuracy when each technique was used individuallyNew method performs well compared to previous methods & lattice-based BM25

Extended proposed method to query-by-example SDR

Lat.-based query by example, under stat. IR modelSignificant improvement over using 1-best trans.Consistently better, under variety of setups

49

Conclusion

Suggestions for future workIncorporate proximity-based search into our method

Formulate a more principled way of deriving lattice pruning thresholds

Examine how stop words affect SDR & query by example

Extend stat. lat.-based SDR framework to other speech processing tasks, e.g. spoken document classification

50

Thank you!

Documents

Lattice-Based Statistical Spoken Document Retrieval Chia Tee Kiah Ph. D. thesis Department of Computer Science School of Computing National University