A quick walk through phonetic databases
• Read English– TIMIT– Boston University Radio News
• Spontaneous English– Switchboard ICSI transcriptions– Buckeye Corpus (VIC)
TIMIT
• Read phonetically balanced sentences– Good coverage of different phonetic environments– Does not exhibit more radical reductions, dysfluencies seen
in spontaneous speech
• Transcribers started from forced alignments, realigned
• Roughly 5 hours of speech– 630 speakers, 8 dialects, 10 sentences apiece
• Uses ARPAbet symbols– Separate stop/closure symbols– Symbol for epenthetic stop
• Cost: $100 for non-1993 LDC members
TIMIT
BU Radio Corpus
• Radio announcers reading news– 4 male, 3 female; reading in both “non-studio” and
“studio” voices
• Originally intended for speech synthesis work– Marked with prosody in addition to phonetics– Marked with ARPAbet (similar to TIMIT)
• > 7 hours of speech• Cost: $400 for non-1996/1997 LDC members
BU Radio Corpus
Switchboard ICSI Transcriptions
• Spontaneous speech, many dialect regions– Transcribed “segmented turns,” some of which may be
cutoffs, from 2-party conversations– 4 hours of speech transcribed
• 2 stages:– Initial 1 hour phonetically transcribed– Hours 2-4 phonetic markers, syllable boundaries -- back
aligned with phonetic markers
• Similar phoneset to TIMIT– No separate closure/release– Voiced hesitations (pn/pv)
• Cost: possibly free, possibly $2k for non-1993/7
Switchboard ICSI Transcriptions
VIC (Buckeye) Corpus
• Spontaneous interview speech– Age, gender balanced– All speakers from Ohio
• Currently in transcription– NIH grant involving Keith, me, and Mark Pitt– 10 hours completed, 30 hours total
• Based on ARPAbet with a few additions– Nasalized vowels, glottal stop replacing /t/,…
• Cost: free (to us) -- might need to work out licensing but shouldn’t be an issue.
VIC (Buckeye) Corpus
Evaluating with Corpora
• Clear thing to do is to start with TIMIT– Facilitates comparison with other things
• However, we should really try to insert spontaneous data into research ASAP– Maybe move to some combination of
TIMIT/SWB/VIC?
• Only talked about (American) English– Other languages in year 4?– Chin has done some work in Mandarin?
• CASS corpus: phonetically transcribed, but available?