Chinese Spoken Document Retrieval and Organization

2006/08/03

Chinese Spoken Document Retrieval and Organization

Berlin Chen ( 陳柏琳 )

Associate ProfessorDepartment of Computer Science ＆ Information Engineering

National Taiwan Normal University

2

Outline

• Introduction• Information Retrieval Models• Spoken Document Summarization• Spoken Document Organization• Language Model Adaptation for Spoken Document

Transcription• Conclusions and Future Work

3

Introduction (1/2)

• Multimedia associated with speech is continuously growing and filling our computers and lives– Such as broadcast news, lectures, historical archives, voice

mails, meeting records, spoken dialogues etc.– Speech is the most semantic (or information)-bearing

• On the other hand, speech is the primary and the most convenient means of communication between people– Speech provides a better (or natural) user interface in wireless

environments and on smaller hand-held devices

• Multimedia information access using speech will be very promising in the near future

4

Introduction (2/2)

Information Extraction and Retrieval

(IE & IR)

Spoken Dialogues

Spoken DocumentUnderstanding and

Organization

MultimediaNetworkContent

NetworksUsers

˙

Named EntityExtraction

Segmentation

Topic Analysisand Organization

Summarization

Title Generation

InformationRetrieval

Two-dimensional Tree Structurefor Organized Topics

Chinese Broadcast News Archive

retrievalresults

titles,summaries

Input Query

User Instructions

Named EntityExtraction

Segmentation

Topic Analysisand Organization

Summarization

Title Generation

InformationRetrieval

Two-dimensional Tree Structurefor Organized Topics

Chinese Broadcast News Archive

retrievalresults

titles,summaries

Input Query

User Instructions

N

i

d

d

d

d

.

.

.

.2

1

K

k

T

T

T

T

.

.2

1

n

j

t

t

t

t

.

.2

1

idP ik dTP kj TtP

documents latent topics query

nj ttttQ ....21Q

N

i

d

d

d

d

.

.

.

.2

1

N

i

d

d

d

d

.

.

.

.2

1

K

k

T

T

T

T

.

.2

1

K

k

T

T

T

T

.

.2

1

n

j

t

t

t

t

.

.2

1

n

j

t

t

t

t

.

.2

1

idP ik dTP kj TtP

documents latent topics query

nj ttttQ ....21Q

Multimodal Interaction

Multimedia Content Processing

• Scenario of Multimedia Access using Speech

5

Related Research Work

• Spoken document transcription, organization and retrieval have been active focuses of much research in the recent past [R3, R4]– Informedia System at Carnegie Mellon Univ.– Multimedia Document Retrieval Project at Cambridge Univ.– Rough’n’Ready System at BBN Technologies– SpeechBot Audio/Video Search System at HP Labs– Spontaneous Speech Corpus and Processing Technology Proje

ct at Japan, etc.– ….

6

Related Research Issues (1/2)

• Transcription of Spoken Documents– Automatically convert speech signals into sequences of words or

other suitable units for further processing

• Audio Indexing and Information Retrieval– Robust representation of the spoken document

– Matching between (spoken) queries and spoken documents

• Named Entity Extraction from Spoken Documents – Personal names, organization names, location names, event

names

– Very often out-of-vocabulary (OOV) words, difficult for recognition

• Spoken Document Segmentation– Automatically segment spoken documents into short paragraphs,

each with a central topic

7

Related Research Issues (2/2)

• Information Extraction for Spoken Documents– Extraction of key information such as who, when, where, what

and how for the information described by spoken documents

• Summarization for Spoken Documents– Automatically generate a summary (in text/speech form) for each

short paragraph

• Title Generation for Multi-media/Spoken Documents– Automatically generate a title (in text/speech form) for each short

paragraph; i.e., a very concise summary indicating the topic area

• Topic Analysis and Organization for Spoken Documents– Analyze the subject topics for the short paragraphs; or organize

the subject topics of the short paragraphs into graphic structures for efficient browsing

8

An Example System for Chinese Broadcast News (1/2)

• For example, a prototype system developed at NTU for efficient spoken document retrieval and browsing [R4]

9

• Users can browse spoken documents in top-down and bottom-up manners

– http://sovideo.iis.sinica.edu.tw/SLG/ngsst/ngsst-index.html

An Example System for Chinese Broadcast News (2/2)

10

Information Retrieval Models

• Information retrieval (IR) models can be characterized by two different matching strategies– Literal term matching

• Match queries and documents in an index term space– Concept matching

• Match queries and documents in a latent semantic space

香港星島日報篇報導引述軍事觀察家的話表示，到二零零五年台灣將完全喪失空中優勢，原因是中國大陸戰機不論是數量或是性能上都將超越台灣，報導指出中國在大量引進俄羅斯先進武器的同時也得加快研發自製武器系統，目前西安飛機製造廠任職的改進型飛豹戰機即將部署尚未與蘇愷三十通道地對地攻擊住宅飛機，以督促遇到挫折的監控其戰機目前也已經取得了重大階段性的認知成果。根據日本媒體報導在台海戰爭隨時可能爆發情況之下北京方面的基本方針，使用高科技答應局部戰爭。因此，解放軍打算在二零零四年前又有包括蘇愷三十二期在內的兩百架蘇霍伊戰鬥機。

中共新一代空軍戰

力relevant ?

11

IR Models: Literal Term Matching (1/2)

• Vector Space Model (VSM)– Vector representations are used for queries and documents– Each dimension is associated with a index term and usually

represented by TF-IDF weighting– Cosine measure for query-document relevance

– VSM can be implemented with an inverted file structure for efficient document search

x

y

ni Qi

ni ji

ni qiji

j

j

j

ww

ww

QD

QD

QDsim

12,1

2,

1 ,,

||||)(cosine

),(

jD

Q

12

IR Models: Literal Term Matching (2/2)

• Hidden Markov Models (HMM) [R1]– Also known as Language Modeling (LM) approaches– Each document is a probabilistic generative model consisting of a

set of N-gram distributions for predicting the query

– Models can be optimized by the expectation-maximization (EM) or minimum classification error (MCE) training algorithms

– Such approaches do provide a potentially effective and theoretically attractive probabilistic framework for studying IR problems

Li wwwwQ ....21 3m

4m

ji DwP

CwP i

jii DwwP ,1

CwwP ii ,1

1m

2mQuery

,,

14132

21

1211

CwwPmDwwPmCwPmDwPm

CwPmDwPmDQP

iijii

L

iiji

jj

13

IR Models: Concept Matching (1/3)

• Latent Semantic Analysis (LSA) [R2]– Start with a matrix describing the intra- and Inter-document

statistics between all terms and all documents– Singular value decomposition (SVD) is then performed on the

matrix to project all term and document vectors onto a reduced latent topical space

– Matching between queries and documents can be carried out in this topical space

=

mxn

kxk

mxk

kxn

A Umxk

Σk VTkxm

w1

w2

wi

wm

D1 D2 Di Dn

w1

w2

wi

wm

D1 D2 Di Dn

111ˆ

kkkmm

Tk Uqq

dq

dqdqcoinedqsim

T

ˆˆ

ˆˆ)ˆ,ˆ(ˆ,ˆ

2

14


• Probabilistic Latent Semantic Analysis (PLSA) [R5, R6]– An probabilistic framework for the above topical approach

– Relevance measure is not obtained directly from the frequency of a respective query term occurring in a document, but has to do with the frequency of the term and document in the latent topics

– A query and a document thus may have a high relevance score even if they do not share any terms in common

kjk

kk

kjkk

j

DTPQTP

DTPQTPDQR

22),(

=

D1 D2 Di Dn

mxn

kxk

mxk

rxn

PUmxk

Σk VTkxn

ki TwP

kTPw1

w2

wi

wm

kj TDP

ij DwP ,

w1

w2

wj

wm

D1 D2 Di Dn

kjkk

kiji TDPTPTwPDwP ,

15


• PLSA also can be viewed as an HMM model or a topical mixture model (TMM)

– Explicitly interpret the document as a mixture model used to predict the query, which can be easily related to the conventional HMM modeling approaches widely studied in speech processing community (topical distributions are tied among documents)

– Thus quite a few of theoretically attractive model training algorithms can be applied in supervised or unsupervised manners

1TwP i

jDTP 2

A document model

Query

Li wwwwQ ....21

jDTP 1

jK DTP

2TwP i

Ki TwP

L

i

K

kjkkij DTPTwPDQP

1 1

16

Preliminary IR Evaluations

• Experiments were conducted on TDT2/TDT3 spoken document collections [R6]– TDT2 for parameter tuning/training, while TDT3 for evaluation– E.g., mean average precision (mAP) tested on TDT3

• HMM/PLSA/TMM are trained in a supervised manner

• Language modeling approaches (TMM/PLSA/HMM) are evidenced with significantly better results than that of conventional statistical approaches (VSM/LSA) for the above spoken document retrieval (SDR) task

TMM HMM PLSA VSM LSA

TD 0.7870 0.7174 0.6882 0.6505 0.6440

SD 0.7852 0.7156 0.6688 0.6216 0.6390

17

Spoken Document Summarization (1/3)

• Extractive Summarization– Extract a number of indicative sentences form the original spoken

document– A lightweight process as compared with abstractive summarization

• Use TMM to explore the relationships between the handcrafted titles and the documents of the contemporary newswire text collection for selecting salient sentences form spoken documents– Build a probabilistic latent topical space

1TwP i

lgSTP ,2

A sentence model

Document

gLig wwwwD ....21

lgSTP ,1

lgK DTP ,

2TwP i

Ki TwP

18


• An illustration of the TMM-based extractive summarization

• The sentence model can be obtained by probabilistic “fold-in”

H1

H2D2

HMDM

T1

T2

TK

Sg,1

Sg,2

Sg,3

Sg,L

Spoken Document Dg

Sg,l

D1

T1

T2

TK

Text Documents

Sentence

ji

ji

Dwc

Dw

K

kjkkijj HTPTwPHDP

,

1

gi

gi

Dwc

Dw

K

klgkkilgg STPTwPSDP

,

1,,

HjDj

lgn

lgi

Swlgi

Swlgikgi

lgk Swc

SwTPDwc

STP

,

,

,

,

, ,

,,

K

klgkki

lgkkilgik

STPTwP

STPTwPSwTP

1,

,,, where,

19


• Preliminary Tests on 200 radio broadcast news stories collected in Taiwan (automatic transcripts with 14.17% character error rate) – ROUGE-2 measure was used to evaluate the performance levels

of different models

• TMM yields considerable performance gains for the results with lower summarization ratios

VSM LSA-1 LSA-2 SigSen TMM Random

10% 0.2603 0.1872 0.24125 0.1976 0.2808 0.1466

20% 0.2739 0.2089 0.25231 0.2332 0.2837 0.1696

30% 0.3092 0.2288 0.26693 0.2724 0.3100 0.1956

50% 0.4107 0.3772 0.35121 0.3883 0.4266 0.3436

20

Spoken Document Organization (1/3)

• Each document is viewed as a TMM model to generate itself – Additional transitions between topical mixtures have to do with

the topological relationships between topical classes on a 2-D map

Two-dimensional Tree Structure

for Organized Topics

2

2

2

,exp

2

1,

lk

klTTdist

TTE

K

sks

klkl

TTE

TTEYTP

1,

,

jLij wwwwD ....21

1TwP i jDTP 1

jK DTP

2TwP i

Ki TwP

jDTP 2

A document model

Document

K

k

K

llnklikii TwPYTPDTPDwP

1 1

21


• Document models can be trained in an unsupervised way by maximizing the total log-likelihood of the document collection

• Initial evaluation results (conducted on the TDT2 collection)

– TMM-based approach is competitive to the conventional Self-Organization Map (SOM) approach

V

ijiji

n

jT DwPDwcL

11log,

Model Iterations distBet/distWithin

TMM

10 1.9165

20 2.0650

30 1.9477

40 1.9175

SOM 100 2.0604

22


• Each topical class can be labeled by words selected using the following criterion

• An example map for international political news

n

ijkji

n

jjkji

ki

DTPDwc

DTPDwc

TwSig

1

1

1,

,

,

23

Dynamic Language Model Adaptation (1/3)

• TMM also can be applied to dynamic language model adaptation for spoken document recognition

– In word graph rescoring, the history of each newly occurring word can be treated as a document used to predict itself

iwH

iw iw

K

kwkkiwiTMM ii

HTPTwPHwP1

24

Automatic Speech Recognition (ASR)

• A search process to uncover the word sequence that has the maximum posteriorprobability

m21 w,...,wwˆ W

XWP

WXW

X

WXW

XWW

W

W

W

PP

P

PP

Pˆ

max arg

max arg

max arg

Language Model Probability Acoustic Model Probability

Unigram:

Bigram:

Trigram:

1j

j1j1jj1kk121k21 wC

wwCwwP,wwP...wwPwPw..wwP

i

i

jjk21k21 wC

wCwP,wP...wPwPw..wwP

1j2j

j1j2j2k1kk1k2kk213121k21 wwC

wwwCwwwP,wwwP...wwwPwwPwPw..wwP

N-gramLanguage Modeling

N21i

mi21

,.....,v,vv:Vw

w,...,w,..w,w

where W

25


• For each history model, the probability distributions is trained beforehand and kept fixed during rescoring, while the probability can be dynamically estimated (folded-in)

• TMM LM can be incorporated with background N-gram LM via probability interpolation or scaling– Probability Interpolation

– Probability Scaling

ki TwP

i

iwnii

iw

Hwwnkwn

wkH

HwTPHwn

HTP

,,

ˆ

iwk HTP

K

llnwl

knwk

wnk

TwPHTP

TwPHTPHwTP

i

i

i

1

, where

12121 1 ~

iiiBGwiTMMiiiAdapt wwwPHwPwwwPi

V

llBGwlTMMiilBG

iBGwiTMMiiiBGiiiAdapt

wPHwPwwwP

wPHwPwwwPwwwP

i

i

112

12122

~

26


• Preliminary Tests on 500 radio broadcast news stories

– TMM is competitive to conventional MAP-based LM adaptation, and is superior than LSA

– Combination of TMM (topical information) and MAP (contextual information) approaches yields additional performance gains

CER (%)

PP

Baseline (Background Trigram LM) 15.22 752.49

+TMM (Interpolation) 14.47 502.50

+TMM (Scaling) 13.87 493.26

+LSA (Scaling) 15.19 676.67

+MAP In-domain Trigram 13.74 430.59

+TMM (Scaling) + MAP In-domain Trigram 13.27 306.75

27

Prototype Systems Developed at NTNU (1/3)

• Spoken Document Retrieval

http://sdr.csie.ntnu.edu.tw

28


• Speech Retrieval and Browsing of Digital Historical Archives

29


• Systems Currently Implemented with a Client-Server Architecture

InvertedFiles

Multi-ScaleIndexer

Automatically TranscribedBroadcast News Corpus

InformationRetrieval Server

MandarinLVCSR Server

PDA Client

Audio StreamingServer

Word-levelIndexingFeatures

Syllable-levelIndexingFeatures

30

Conclusions and Future Work

• Multimedia information access using speech will be very promising in the near future– Speech is the key for multimedia understanding and organization– Several task domains still remain challenging

• PLSA/TMM-based modeling approaches using probabilistic latent topical information are promising for language modeling, information retrieval, document summarization, segmentation, organization, etc.– A unified framework for spoken document recognition, organization

and retrieval was initially investigated

31

References

[R1] B. Chen et al., “A Discriminative HMM/N-Gram-Based Retrieval Approach for Mandarin Spoken Documents,” ACM Transactions on Asian Language Information Processing, Vol. 3, No. 2, June 2004

[R2] J.R. Bellegarda, “Latent Semantic Mapping,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005

[R3] K. Koumpis and S. Renals, “Content-based access to spoken audio,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005

[R4] L.S. Lee and B. Chen, “Spoken Document Understanding and Organization,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005

[R5] T. Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, Vol. 42, 2001

[R6] B. Chen, “Exploring the Use of Latent Topical Information for Statistical Chinese Spoken Document Retrieval,” Pattern Recognition Letters, Vol. 27, No. 1, Jan. 2006

Thank You!