Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
AUTOMATIC SUMMARIZATION OF HIGHLY SPONTANEOUS SPEECH
András Beke – György Szaszák
ResearchInstitutefor Linguistics
Hungarian Academy of
Sciences
Dept.ofTelecommunicationsandMediaInformatics
University of Technology and Economics
DATAEXPLOSION:HUMANVSMACHINEPROCESSING
2006 2007 2008 2009 2010 2011
180EXABYTE
1,800EXABYTE
The UNSTUCTURED data explosion- Growing 10x every 5 years and 100x every 10 years- Requires a new approach
The human not able toread every documentumor to listen every audiofile, or watch everymovies!
STATISTICAL MACHINE LEARNING
STUCTURED data
AUTOMATICSUMMARIZATION
• informationretrieval• documentclustering• informationextraction• visualization• questionanswering• textsummarization
POSSIBLEAPPROACHES
TYPESOFSUMMARIZATION
• Indicative• Describesthedocumentanditscontents
• Informative• ‘Replaces’thedocument
• Extractive• Concatenatepiecesofexistingdocument
• Generative• Createsanewdocument
• Documentcompression
COMPARINGSPEECHANDTEXTSUMMARIZATION
• Identifyingimportantinformation
• Somelexical,discoursefeatures
• Extractionorgenerationorcompression
– Speech Signal– Prosodic features– NLP tools?– Segments –
sentences?– Generation?– Errors– Data size
ALIKE DIFFERENT
SentenceExtraction/Similaritymeasures(Salton,etal.1995)
• Extractsentencesbytheirsimilaritytoatopicsentenceandtheirdissimilaritytosentencesalreadyinsummary(MaximalMarginalRelativity)
• Similaritymeasures• CosineMeasure• VocabularyOverlap• Topicwordoverlap• ContentSignaturesOverlap
• Automaticsentencesegmentation(tokenization)iscrucialbeforesuchasentencebasedextractivesummarization(Liu andXi 2008).Thedifficultycomesnotonlyfromrecognitionerrors,butalsofrommissingpunctuationmarks,whichwouldbefundamentalinsyntacticparsingandPOStagging(disambiguation).
Our work
• This worksonextractivesummarizationusetwomajorsteps:
1) thefirststepisrankingthesentencesbasedontheirscoreswhicharecomputedbycombiningfeaturessuchastermfrequency(TF),positionalinformationandcuephrases;
2) thesecondstepconsistsinselectingafewtoprankedsentencestopreparethesummary
Our work• InthisworkwepresentaninitialefforttodevelopaHungarianspeechsummarizationsystem.
• Summarizationwillalsobecomparedtoabaselineversionusingtokensavailablefromhumanannotation.
• Incurrentworkweproposeaprosodybasedautomatictokenizerwhichrecoversintonational phrases(IP)anduseIPsassentencelikeunitsinfurtheranalysis.
• Thebaselinetokenizationreliesonacoustic(silence)andsyntactic-semantic(syntactically orsemanticallycloselytogetherbelonging)axes.
• InHungarian,bothspeechrecognitionandtext-basedsyntacticalanalysisaredifficultcomparedtoEnglishduetotheveryrichmorphologyofthelanguage.
MATERIALAND
SPEECH-TO-TEXT
SPEECHMATERIAL
• 4interviewsfromtheBEAHungarianSpontaneousSpeechdatabase
• Participantstalkabouttheirjobs,family,andhobbies.• Threeofthespeakersaremaleandoneofthemisfemale.
• AllspeakersarenativeHungarian,livinginBudapest(agedbetween30and60).
• Thetotalmaterialis28minuteslong(averagedurationwas7minutesperparticipant).
TheSpeech-to-TextSystem• Weuse160interviewsfromBEA,accountingfor120hoursofspeech(theinterviewerdiscarded)totrainSpeech-to-text(S2T)acousticmodels.
• Speakersinvolvedinthe4interviewsusedforsummarizationareheldout.
• UsingtheKalditoolkitwetrain3hiddenlayerDNNacousticmodelswith2048neuronsperlayerandtanh non-linearityon160interviewsfromBEA(Hungarian).
• Inputdatais9xsplicedMFCC13+CMVN+LDA/MLLT.• Atrigramlanguagemodelistrainedontranscriptsofthe160interviewsaftertextnormalization,withKneser-Neysmoothing.Dictionariesareobtainedusingarule-basedphonetizer (spokenHungarianisveryclosetothewrittenform).
WordError Rate (WER)was found around 44%for this task.This relative highWERisjustified by the high spontaneity ofspeech.Stem error rate was found to besomewhat smaller,39%.
UTTERANCESEGMENTATION(THEIP-TOKENIZER)• ThissystemusesphonologicalphrasemodelsandalignsthemtotheinputspeechbasedonprosodicfeaturesF0andmeanenergy. (Szaszák andBeke2012).
• Inthisworkweuseittoobtainsentence-liketokensfromspeech-to-textoutput.
WeusetheIPtokenizerinanoperatingpointwithhighprecision(96%onreadspeech)andlowerrecall(80%onreadspeech).
THE SUMMARIZATION APPROACH
BLOCKDIAGRAM
PRE-PROCESSING
• Stopwordsareremovedfromthetokensandstemmingisperformed.Stop-wordsarecollectedintoalist,whichcontains
• allwordstaggedasfillersbytheS2Tcomponent(speakernoise)and
• apredefinedsetofnon-contentwordssuchasarticles,conjunctionsetc.
• Themagyarlánc toolkit(Zsibrita etal.2013) wasusedforthestemmingandPOS-taggingoftheHungariantext.
• Thewordsarefilteredtokeeponlynouns.
TEXTUALFEATUREEXTRACTION(WORDLEVEL)
• TF-IDF(TermFrequency- InverseDocumentFrequency)reflectstheimportanceofasentenceandisgenerallymeasuredbythenumberofkeywordspresentinit.TheimportancevalueofasentenceiscomputedasthesumofTF-IDFvaluesofitsconstituentwords(inthiswork:nouns)dividedbythesumofallTF-IDFvaluesfoundinthetext.
• LatentSemanticAnalysis(LSA)exploitscontexttotrytofindwordswithsimilarmeaning.LSAisabletoreflectbothwordandsentenceimportance.SingularValueDecomposition(SVD)isusedtoassesssemanticsimilarity.
TEXTUALFEATUREEXTRACTION(SENTENCELEVEL)• PositionalValue:themoremeaningfulsentencescanbefoundatthebeginningofthedocument.
• Thisisevenmoretrueincaseofspontaneousnarratives,astheintervieweraskstheparticipanttotellsomethingabouther/hislife,job,hobbies
Pk =1/√kwherethePk isthepositionalscoreofkth sentence.
• Sentenceranking:TherankingscoreRSK iscalculatedasthelinearcombinationoftheso-calledthematictermbasedscoreSk andpositionalscorePk.Thefinalscoreofasentencekis:
whereαisthelower,βistheuppercut-offforthesentenceposition (0≤α,β≤1)andLL isthelowerandLU istheuppercut-offonthesentencelengthLk.
• Sentencelength: Usuallyashortsentenceislessinformativethanalongeroneandhence,readersorlistenersaremorepronetoselectalongersentencethanashortonewhenaskedtofindgoodsummarizingsentencesindocuments.Ifasentenceistooshortortoolong,itisassignedarankingscoreof0.
SUMMARYGENERATION
Thelaststepistogeneratethesummary.InthisprocesstheN-toprankedsentencesareselectedfromthetext(Sarkar 2012).WesetNto10,sothefinaltextsummarycontainsthetop10sentences.
EVAULATION
METRICS
Compare reference summary andsystem summary• OBJECTIVE
• F1-measure:soft comparison• ROUGE:hard comparison
Create reference summary
• 10participantswereaskedtoselectupto10sentencesthattheyfindtobethemostinformativeforagivendocument(presentedalsoinspokenandinwrittenform).
• Participantsused6.8sentencesonaveragefortheirsummaries.Foreachnarrative,asetofreferencesentenceswascreated:sentenceschosenbyatleast1/3oftheparticipantswereaddedtothereferencesummary.
EXPERIMENTS
EXPERIMENST3setups:• OT-H:Usetheoriginaltranscribedtextassegmentedbythehumanannotatorsintosentence-likeunits.
• S2T-H:Usespeech-to-textconversiontoobtaintext,butusethehumanannotatedtokens.
• S2T-IP:Usespeech-to-textconversiontoobtaintextandtokenizeitbasedonIPboundarydetectionfromspeech.
Soft comparison Hard comparison
Setup Approach Recall % Precision % F1 Recall Precision F1
OT-HTF-IDF 0.51 0.76 0.61 0.36 0.28 0.32LSA 0.36 0.71 0.46 0.36 0.3 0.32
S2T-HTF-IDF 0.51 0.8 0.61 0.34 0.29 0.31LSA 0.49 0.77 0.56 0.39 0.27 0.32
S2T-IPTF-IDF 0.62 0.79 0.68 0.33 0.28 0.30LSA 0.59 0.78 0.65 0.33 0.32 0.32
CONCLUSION
CONCLUSION
• ThispaperaddressedspeechsummarizationforhighlyspontaneousHungarian.GiventhishighdegreeofspontaneityandalsotheheavyagglutinatingpropertyofHungarian,webeleive theobtanied resultsarepromisingastheyarecomparabletoresultspublishedforotherlanguages(Campr,M.,Ježek 2015).
• TheproposedIPdetectionbasedtokenizationwasassuccessfulastheavailablehumanone.Theoverallbestresultswere62%recalland79%precision(F1=0.68).