Download ppt - A quick walk through phonetic databases

A quick walk through phonetic databases

• Read English– TIMIT– Boston University Radio News

• Spontaneous English– Switchboard ICSI transcriptions– Buckeye Corpus (VIC)

TIMIT

• Read phonetically balanced sentences– Good coverage of different phonetic environments– Does not exhibit more radical reductions, dysfluencies seen

in spontaneous speech

• Transcribers started from forced alignments, realigned

• Roughly 5 hours of speech– 630 speakers, 8 dialects, 10 sentences apiece

• Uses ARPAbet symbols– Separate stop/closure symbols– Symbol for epenthetic stop

• Cost: $100 for non-1993 LDC members

TIMIT

BU Radio Corpus

• Radio announcers reading news– 4 male, 3 female; reading in both “non-studio” and

“studio” voices

• Originally intended for speech synthesis work– Marked with prosody in addition to phonetics– Marked with ARPAbet (similar to TIMIT)

• > 7 hours of speech• Cost: $400 for non-1996/1997 LDC members

BU Radio Corpus

Switchboard ICSI Transcriptions

• Spontaneous speech, many dialect regions– Transcribed “segmented turns,” some of which may be

cutoffs, from 2-party conversations– 4 hours of speech transcribed

• 2 stages:– Initial 1 hour phonetically transcribed– Hours 2-4 phonetic markers, syllable boundaries -- back

aligned with phonetic markers

• Similar phoneset to TIMIT– No separate closure/release– Voiced hesitations (pn/pv)

• Cost: possibly free, possibly $2k for non-1993/7

Switchboard ICSI Transcriptions

VIC (Buckeye) Corpus

• Spontaneous interview speech– Age, gender balanced– All speakers from Ohio

• Currently in transcription– NIH grant involving Keith, me, and Mark Pitt– 10 hours completed, 30 hours total

• Based on ARPAbet with a few additions– Nasalized vowels, glottal stop replacing /t/,…

• Cost: free (to us) -- might need to work out licensing but shouldn’t be an issue.

VIC (Buckeye) Corpus

Evaluating with Corpora

• Clear thing to do is to start with TIMIT– Facilitates comparison with other things

• However, we should really try to insert spontaneous data into research ASAP– Maybe move to some combination of

TIMIT/SWB/VIC?

• Only talked about (American) English– Other languages in year 4?– Chin has done some work in Mandarin?

• CASS corpus: phonetically transcribed, but available?