DU, C-SIIT1 Collecting and Transcribing Real Chinese Spontaneous Telephone Speech Corpus Limin Du, Chair Professor Director, Center for Speech Interactive

DU, C-SIIT 1

Collecting and Transcribing Real Chinese Spontaneous Telephone Speech

Corpus

Limin Du, Chair Professor

Director, Center for Speech Interactive Information Technology Institute of Acoustics, Chinese Academy of Sciences

October 21, 2000

DU, C-SIIT 2

Background Spontaneous speech interactive via telephone is a very

prospect application, building speech recognition systems in terms of the variations in acoustics and spoken styles for telephone application is necessary

There is no large-scale Chinese Spontaneous Telephone Speech Corpus available for research– Simulating telephone speech corpus (1997, C-SIIT, IOA, CAS)

Microphone speech corpus – pipeline to telephone – telephone speech

– Collecting real telephone speech data seems to be a formidable task

Laws Costs

Chinese-English speech translation (CEST) project, an collaboration between CAS-AT&T (1998-2003) is an strong driving for this work

DU, C-SIIT 3

Real Telephone Speech CollectionReal Telephone Speech Collection

A “dialogue oriented” collection paradigm– Human-Human conversationsHuman-Human conversations– Human-machine dialoguesHuman-machine dialogues

RealInformation

Service Center

Caller

HotelInformation

Desk

Computer-

phone

OR

Dialogue

card

Caller

Simulated Human orMachine Service Agent

Computer

Data storage

Labeling is so cool!

DU, C-SIIT 4

Speech Data ProcessingSpeech Data Processing

SamplingSampling– 8kHz sampling8kHz sampling– 16bits A/D quantization16bits A/D quantization

Utterance SegmentationUtterance Segmentation– One Speaker switching for one utteranceOne Speaker switching for one utterance– Utterances in average length of 3 secondsUtterances in average length of 3 seconds

DU, C-SIIT 5

Speech Data Transcribing

What to Label? How to Label?

DU, C-SIIT 6

What to Label?What to Label?

Information about Speakers and Environments– speaker’s dialect, mood, gender, speech quality

Transcribing– Chinese characters– Pinyins– Other acoustic event labels

laugh, lip smack, throat clearing, breath, cough, filled pauses, telephone adjusting, background speech, etc.

Time StampTime Stamp– Other acoustic event are bracketed with time stamps are bracketed with time stamps

automatically when transcribing with a special software toolautomatically when transcribing with a special software tool

DU, C-SIIT 7

Detailed Issues Concerned

MispronunciationMispronunciation– Mispronunciation often occurs in daily life. For Mispronunciation often occurs in daily life. For

example the speaker probably read Chinese example the speaker probably read Chinese character “character “ 山 ” 山 ” ((who’s correct pronunciation is who’s correct pronunciation is “shan1”) as “san2”. In such a case, the associated “shan1”) as “san2”. In such a case, the associated speech segment is transcribed as “speech segment is transcribed as “ 山山 ((san2)” to san2)” to present the present the right textright text and and real pronunciationreal pronunciation

NumbersNumbers– Arabia representation of numbers is a natural Arabia representation of numbers is a natural

method, but it cannot be mapped to a single method, but it cannot be mapped to a single pronunciation. So, transcribers are required to pronunciation. So, transcribers are required to transcribe all numbers with Chinese characterstranscribe all numbers with Chinese characters

DU, C-SIIT 8

Other Acoustic EventsOther Acoustic Events 文件文件识别结果识别结果听觉判断听觉判断

– PAUSE1PAUSE1 AIAI [UH][UH]– PAUSE14PAUSE14 AIAI [UH][UH]– PAUSE12PAUSE12 AA [UNG][UNG]– PAUSE33PAUSE33 KA AKA A [UNG][UNG]– PAUSE20PAUSE20 ANGANG [UNG][UNG]– PAUSE26PAUSE26 ANGANG [UNG][UNG]– PAUSE19PAUSE19 ANAN [EN][EN]– PUASE4PUASE4 CHACHA [AO][AO]– PAUSE18PAUSE18 GANGAN [UH][UH]– PAUSE21PAUSE21 HEHE [EN][EN]– PAUSE27PAUSE27 NENE [EN][EN]– PAUSE22PAUSE22 YUNYUN [UM][UM]– PAUSE34PAUSE34 LENGLENG [UH][UH]– PAUSE15PAUSE15 TONGTONG [UH][UH]

DU, C-SIIT 9

Other Acoustic Events(cnt)Other Acoustic Events(cnt) 文件文件识别结果识别结果听觉判断听觉判断

– PAUSE31PAUSE31 NONGNONG [EN][EN]

– PAUSE17PAUSE17 HENHEN [EN][EN]

– PAUSE24PAUSE24 ENEN [EN][EN]

– [AA][AA]

– [AI][AI]

– [EN][EN]

– [UH][UH]

– [AO][AO]

– [SIL][SIL] 无声段无声段– [[NOISE]NOISE]

– [LAUGH][LAUGH]

– [ANG] [BREATH][ANG] [BREATH] 呼吸呼吸– [[HESITATION] HESITATION] 犹豫犹豫

DU, C-SIIT 10

Transcription ExampleTranscription Example

<BeginStamp 0>[FILLER]<EndStamp 257> <BeginStamp 260> [NOISE] <EndStamp 928>“ 北京游乐园怎么走”东直门到哪“北京游乐园”北京游乐园是吗“ <BeginStamp 5933> [FILLER]<EndStamp 6250>”<BeginStamp 6228> [FILLER]<EndStamp 6386> 稍等

DU, C-SIIT 11

How to Label?How to Label?

Improving transcribers’ efficiency & reducing Improving transcribers’ efficiency & reducing the possibility to generate errorsthe possibility to generate errors– A labeling tool developed specially for this task.A labeling tool developed specially for this task.

Training transcribersTraining transcribers– Usually our employees assisted speech research Usually our employees assisted speech research

for more than one year and with good working for more than one year and with good working recordsrecords

– Part time employees trained by our employees Part time employees trained by our employees before working atbefore working at

DU, C-SIIT 12

Statistical Results in GeneralChinese Spontaneous Telephone Speech Corpus (CSTSC)

# of Speakers 600# of h-h dialogues # of h-h dialogues 1000# of h-m dialogues# of h-m dialogues 38

Av dura per dialogues 3.5 minutes

Sampling of Speech 8 kHz

Quantization of Speech

16 bits

DU, C-SIIT 13

Statistical Results in Details 180 human-human dialogues, 38 human-machine dialogues

Special Events Count Explanation

Numbers 700 Numbers

Filled pauses 5900 Short non-silence disfluencies, such as [um],[uh] [eh] [ou]

Hesitation 300 Short silence in the context of disfluencies

Laugh 109 Laughter

Breath 98 Breath

Bksound 2300 The caller speaks in a evident noise environment.

MutiSound 570 The caller’ and the service agent speak at same time.

Barge_in 68 The speakers barge in the system’s prompt.

Echo 30 The machine’s echo prompt.

Noise 2000 Non-speech Noise and background speech noise

DU, C-SIIT 14

Summary C-SIIT, CAS started the work to build telephone C-SIIT, CAS started the work to build telephone

speech corpora under very limited budget 3 speech corpora under very limited budget 3 years agoyears ago

The The efforts and experiencesefforts and experiences in collecting real in collecting real Chinese telephone speech corpus are introduced Chinese telephone speech corpus are introduced

C-SIIT C-SIIT will continue the Activitywill continue the Activity on Real Chinese on Real Chinese Telephone and Mobile phone Speech Corpora Telephone and Mobile phone Speech Corpora and try best to make most of the corpora and try best to make most of the corpora already built ,in building, in planning, released to already built ,in building, in planning, released to publicpublic

Suggestions and commencesSuggestions and commences from all of you are from all of you are appreciatedappreciated

Thanks!Thanks!

Documents

DU, C-SIIT1 Collecting and Transcribing Real Chinese Spontaneous Telephone Speech Corpus Limin Du, Chair Professor Director, Center for Speech Interactive