33
Language & Speech Technology Arjan van Hessen + * Franciska de Jong* Roeland Ordelman* * Computer science, University Twente + Speech & Language group, TeleCats

Language & Speech Technology

  • Upload
    claus

  • View
    60

  • Download
    0

Embed Size (px)

DESCRIPTION

Language & Speech Technology. Arjan van Hessen + * Franciska de Jong* Roeland Ordelman* * Computer science, University Twente + Speech & Language group, TeleCats. D ocument R etrieval U sing Intelligent D isclosure. DRUID. - PowerPoint PPT Presentation

Citation preview

Page 1: Language & Speech Technology

Language & Speech Technology

Arjan van Hessen+*Franciska de Jong*Roeland Ordelman*

* Computer science, University Twente+ Speech & Language group, TeleCats

Page 2: Language & Speech Technology

DDocument RRetrieval UUsing Intelligent DDisclosure

Page 3: Language & Speech Technology

DRUID

“Developing Tools for the Indexing &

Retrieval of Multi Media Content”

time-coded indexing with DUTCH speech recogniser

television news broadcast benchmark international SDR research parallel sources available (teletext, auto

cues)

Page 4: Language & Speech Technology

Druid: what

• Extract information from non-

textual content

• Classify and index the information

• Give access to the information via

linked time codes

Page 5: Language & Speech Technology

Druid: how

– Speech recognition•Large vocabulary, speaker

independent

– Recognition of visual objects

– Story detection

– Linking to related information

Page 6: Language & Speech Technology

Large vocabulary recognition

Indexing &

Retrieval

Page 7: Language & Speech Technology

Druid Speech recogniser– ABBOT speech recogniser

(Cambridge, Sheffield)

– Feature extraction

– Phone classification (NN)

– Word recognition (HMM)

Page 8: Language & Speech Technology

Broadcast news– Pro’s

• Easy available

• Often high quality, undisturbed speech

• Availability of related sources– (auto-cues, news papers)

– Contra’s• Mixed languages

• Different quality of speech (wide & narrow band), mixed together

Page 9: Language & Speech Technology

Development– British English Dutch

• TNO-NRC corpus: 10h read speech (newspaper data)

– Additional phoneme training• Groningen corpus: 20h read speech

• Speech Styles corpus: 16h spontaneous speech

– Final training• Broadcast corpus: 50 x “8 o’clock news”

broadcasts (10h speech)

• Corpus Spoken Dutch: 1000h spontaneous speech (to be done in 2002)

Page 10: Language & Speech Technology

Language modelling

• Acoustic recognition stops at a certain level

• Recognition can only improve with:– Statistical language models

(large vocabulary recognition)

– Finite state grammars(small vocabulary recognition)

Page 11: Language & Speech Technology

Large vocabulary recognition

• Recognition is directed by– Acoustic features

– Word frequency (= 65K most used words)

– Bi-grams (65K2 combinations)

– Tri-grams (65K3 combinations)

Page 12: Language & Speech Technology

Large vocabulary recognition

• Building reliable acoustic feature requires 100 hours of speech

• Building reliable LM requires 10.000 hours of text

• Different context models (sport, finance, politics etc.)

Page 13: Language & Speech Technology

Language modelling

Standard LM procedure

• text normalisation

Dutch diseases:

• spelling reform 90’s

• compounding

• foreign words

• increase of English

Page 14: Language & Speech Technology

Text collection• Nederlandse Persdata bank

– Electronic version of 4 major Dutch newspapers (1994-2002)

• NOS Auto cues– Daily Auto-cues of the 8 o’clock news and the news for

children (1999-2002)

• TeleText– Daily recording of the teletext of the news, discussion

& sport programs (1998-2002)

• WWW– Daily downloading of news providers & papers (2000-

2002)

Page 15: Language & Speech Technology

Text collectionYear Num of words Num unique words

1994 25 760 248 330 249

1995 26 032 057 332 063

Spellings reform

1999 72 390 543 620 031

2000 92 562 356 704 805

2001 34 098 130 400 969

250 843 334 1 289 865

Number of words of the newspaper collection after normalisation

Page 16: Language & Speech Technology

Phonetic transcriptions

• Phonetic dictionaries– Celex (300k, SAMPA)

– VLIS database (1300k, Van Dale Data Format)

– Rule-based decompounded-compounded dictionary (600k, SAMPA)

• G2P tool– Machine learning algorithm (vd Bosch)

– 95% correct (without syllable & stress information)

Page 17: Language & Speech Technology

Text normalisation I• Cleaning of punctuation marks

• Expansion– Numbers, abbreviations

• Statistical capital letter reduction– Rotterdam, rotterdam, ROTTERDAM Rotterdam

– KOK, Kok, kok kok

• Spelling correction– Reduction of “doubles” caused by the spelling reform of

the nineties (pannekoek pannenkoek)

– Removal, correction, or adding of accentuation marks• cafe, café , cafeé, cafë etc. café

• hét, hèt het

Page 18: Language & Speech Technology

Text normalisation II

• German and Dutch are “compound” languages

• Increased number of words• Relative high number of “new”

words– (Eclipsbril = Eclipse glasses)

• Lowe lexical coverage High OOV– LC = #word/(#distinct words)– OOV = 1- LC

Page 19: Language & Speech Technology

Text normalisation III

drugbeleiddrugbestrijdingdrugbezitdrugdealerdrugdealersdrugdealsdrugdelictdrugdistributeurdruggebruikdruggebruikerdruggebruikersdrughandeldrugkartelsdrugmisbruikdrugrunnerdrugsaanpakdrugsactiedrugsactiesdrugsactiviteitendrugsadviseurdrugsafdelingdrugsaffairedrugsaffairesdrugsafrekeningendrugsattributendrugsavonturendrugsavontuur

drugsbaasdrugsbandendrugsbarondrugsbaronnendrugsbazendrugsbedrijfdrugsbeleiddrugsbendedrugsbendesdrugsbestaandrugsbestellingendrugsbestrijdenddrugsbestrijderdrugsbestrijdersdrugsbestrijdingdrugsbezitdrugsbezittersdrugsboefdrugsboevendrugsbonzendrugsbrigadedrugsbrigadesdrugsbrondrugsbuisjedrugsbureaudrugsbusinessdrugsbuurtdrugscafédrugscafésdrugscampagnesdrugscaredrugscircuit

drugsclansdrugsclipdrugscocktaildrugscocktailsdrugsconferentiedrugsconflictdrugsconnectiesdrugsconsumentdrugsconsumptiedrugscontainersdrugscontroledrugscontrolesdrugsconventiedrugscriminaliteitdrugscrimineeldrugscriminelendrugsdaglichtdrugsdealdrugsdealendrugsdealenddrugsdealendedrugsdealerdrugsdealersdrugsdealsdrugsdebatdrugsdelictdrugsdelictendrugsdeskundigedrugsdiscussiedrugsdodedrugsdodendrugsdollars

drugsdomineedrugsdooddrugsdossierdrugsdossiersdrugsdraaiboekdrugseconomiedrugseenheiddrugsellendedrugsexcessendrugsexperimentdrugsexpertdrugsexpertsdrugsexportdrugsfabricagedrugsfabrikantendrugsfamiliedrugsfunctionarisdrugsgebieddrugsgebruikdrugsgebruikerdrugsgebruikersdrugsgebruiksterdrugsgelddrugsgeldendrugsgelieerdedrugsgeschiedenisdrugsgeschillendrugsgewoontedrugsgoeroedrugsgroeperingendrugsgrondstoffendrugshaarden

drugshandeldrugshandelaardrugshandelaarsdrugshandelaarsterdrugshandelarendrugshandlangersdrugsheldrugshoertjedrugsholdrugshonddrugshondendrugshoofdstaddrugshuizendrugshulpverlenersdrugshulpverleningdrugsimagodrugsimportdrugsindustriedrugsinkomstendrugsinstellingdrugsinvaldrugsinvoerdrugsjachtdrugsjagendedrugsjagerdrugsjarendrugsjongensdrugskarteldrugskartels

Page 20: Language & Speech Technology

Text normalisation VI

• Decompounding– Low frequency compounds are decompounded

if decompounding improves the Lexical Coverage

– 50% of the unique words that were not in one of the phonetic dictionaries could be successfully decompounded although some error were made:

• zeeroverschatten zeerover + schatten zeerovers + chatten

Page 21: Language & Speech Technology

Most / least frequent words

TOP 10TOP 10

• de 5532695• van 2763280• het 2535365• en 2210685• een 2146813• in 1994480• dat 1129136• is 1080972• op 957296• te 897219

DOWN 10DOWN 10

• milko 39• miljardenovername 39• mifune's 39• middeninkomen 39• michelingids 39• mexx 39• metaalnijverheid 39• metaaldetectoren 39• mesquita 39• mervyn 39

Page 22: Language & Speech Technology

Language modelling

90

92

94

96

98

100

20K 65KUK IT FR NL D

Languagecorpus

UKWSJ

ITSole2

4

FRLe

monde

NLPDB

DFR

#words 37M 27M 38M 22M 36M

#distinct words 165K 200K 280K 320K 650K

20K coverage 97.5%

96.3%

94.7% 93.0%

90.0%

65K coverage 99.6%

99.0%

98.3% 97.5%

95.1%

Page 23: Language & Speech Technology

Language modelling

data # words # unique words

ratio

Original 146.564.949 933.296 157,04

After decompounding

149.628.378 628.114 238,22

change + 2.1% -32.6% +51.6%

Effect on the ratio after decompounding

Page 24: Language & Speech Technology

Different language models

0

5

10

15

20

25

30

35

40

45

probability

Sport

Finance

Politics

Wheater

First use the general LM to detect the sub-category

Use the politic LM to improve recognition results

Page 25: Language & Speech Technology

Segmentation I

• Full news broadcasts are too long (20 min.)

• Retrieved items may start and/or stop in the middle of phrases

• different LM has to be assigned to different “stories”

Page 26: Language & Speech Technology

Segmentation II

• Segmentation in phrases, sentences, and paragraphs– Prosodic information

• F0• Pauses

– Different LM assigning

Page 27: Language & Speech Technology

Resultsdescription OOV WER

Basic, 44K words 5.07%

68.5%

+forward/backward training 5.07%

62.4%

+ newspaper corpus 5.07%

53.5%

+ newspaper corpus+ FB training

5.07%

50.2%

+ 65K words 3.54%

46.3%

Page 28: Language & Speech Technology

ResultsWER extra

Read speech 30% (OOV = 2.5%) 15 hrs training material

Broadcast news 36.9% (OOV = 14%)5 hrs training material

Historical archives 90% (OOV = 20%) 1933

Historical archives 60% (OOV = 10%) 1940

Historical archives 43% (OOV = 14%) 1960

Page 29: Language & Speech Technology

DRUID“de Israëlische premier Chevron houdt vanavond en televisie toespraak zullen ingaan op de crisis die is ontstaan na de bloedige aanslagen van het weekend in Jeruzalem en hij vaak zo'n kwam vanochtend vroeg terug uit Amerika heeft gesproken met president Bush het ene op het vliegveld van Tel Aviv pasje om met ministers pers en ben een Jezus met weinig gevoel voor huizen vanavond is het kabinet kabinet beraadt geweld gaat ook vanochtend door op de westelijke Jordaan oever bijen is z'n vijven dertig jarige Palestijn door Israëlische militairen gedood die bij controle proberen te vluchten of stonden Shiva heeft pech”

“de Israëlische premier Sharon houdt vanavond ‘n televisie toespraak. Hij zal dan ingaan op de crisis die is ontstaan na de bloedige aanslagen van dit weekend in Jeruzalem en Haifa. Sharon kwam vanochtend vervroegd terug uit Amerika; daar heeft hij gesproken met president Bush. Meteen al op het vliegveld van TelAviv sprak Sharon met de ministers Peres en Ben Illiëzer en met veiligheidsfunctionarissen. Vanavond is het kabinet kabinetsberaadt. ‘t geweld gaat ook vanochtend door, op de westelijke Jordaanoever bij Jinien is 'n vijfendertig jarige Palestijn door Israëlische militairen gedood toen ie bij controle probeerden te vluchten. Correspondent: Shivra Hertzberg”

7.2 3 December 2001 12:14

Page 30: Language & Speech Technology

OOV problems20% (14k) of the 65k most frequent words (MFW) are not in the phonetic dictionary

86% of these 14k words starts with a capital letter

50% of these 14k words are names (family, geographic, companies) that are not in the phonetic dictionary and are difficult to transcribe by G2P because they often do not follow Dutch transcription rules

Page 32: Language & Speech Technology

DRUID

• Evaluation– A time consuming, boring, but

necessary process!!

Page 33: Language & Speech Technology

Questions ?