CorrectingLanguageErrorsMachineTranslationTechniques
Shamil [email protected]
NaturalLanguageProcessingGroupNationalUniversityofSingapore
^
Whatarelanguageerrors?
• A“languageerror”isadeviationfromrulesofalanguage
• Duetolackofknowledge.
• Madebylearnersofthelanguage.
• Languageerrorsinwritingincludespelling,grammatical,wordchoice,andstylisticerrors
2
HowcanNLPhelp?
• Buildingautomaticgrammarcorrectiontoolsandspellcheckers.
• Rule-basedsystems(e.g.MicrosoftWord),andadvancedsoftwarethatcorrectdifferentkindsoferrors(e.g.Grammarly,Ginger).
• Usefultoolfornon-nativewriters.
• Evidencethatcorrectivefeedbackhelpslanguagelearning(Leacocketal.,AutomatedGrammaticalErrorDetectionforLanguageLearners2ed,2014)
3
GrammaticalErrorCorrectionor“GEC”
• Automaticcorrectionofvariouskindsoferrorsinwrittentext.
Example(input):
Theproblemsbringsomeeffecton affect engineeringdesignfrom in twoaspectaspects,independentinnovationandengineeringapplication.
– fromtheNUSCorpusofLearnerEnglish(NUCLE)
• Mostpopularapproachisthemachinetranslationapproach.
4
TheTranslationApproach
• TreatsGECastranslation taskfrom“bad”Englishà “good”English
Advantages:üAbletolearntexttransformationsfromparalleldata.üSimple,anddoesnotneedlanguage-dependant tools.üCancorrectinteractingerrorsandcomplexerrortypes.
• Typically,statisticalmachinetranslation(SMT)orneuralmachinetranslation(NMT)frameworks.
5
History
SMTforcountabilityerrorsofmassnouns(Brockettetal.2006)
JapaneseSMT-basedGECandLang-8corpus(Mizumoto etal.2011)
CoNLL-2014SharedTask:2/3topsystemsuseSMT
Neuralmodelsasfeatures(Chollampatt etal.2016)
SystemcombinationapproachbeatsCoNLL-2014systems
(Susanto etal.2014)
NeuralmachinetranslationapproachtoGEC
(YuanandBriscoe2016)
GEC-specificfeaturesand(Junczys-Dowmunt andGrundkiewicz 2016)
Combiningwordandcharacter-levelSMT
(Chollampatt andNg,2017)
Convolutionalneuralencoder-decoderforGECachievesbestresults(Chollampatt andNg,AAAI2018– toappear)
6
DataFortraining:• ParallelCorpora
- AnnotatedLearnerDataset:NUCLE- CrawledfromLang-8
• Englishcorpora:Wikipedia,CommonCrawl
Fortesting:CoNLL-2014sharedtasktestset(1312sentences)Metric:F0.5 usingMaxMatch scorer
7
WordandCharacter-levelSMTforGEC
8
StatisticalMachineTranslationApproach
9
TRANSLATION MODEL
LANGUAGE MODEL
SMTDECODER
Parallel Text(Learner Text &Corrected Text)
Well-formed English text
train train
Input Sentence
Output Sentence
StatisticalMachineTranslationApproach
• Usingalog-linearframework:
• Featureweights𝜆" aretunedusingMERToptimizingF0.5 metricondevelopmentset.
10
𝑇∗ = argmax,
𝑃 𝑇 𝑆 = argmax,
/𝜆"(𝑓" 𝑆, 𝑇 )4
"56𝑇∗ :bestoutputsentenceS:sourcesentenceT:candidateoutputsentenceN:numberoffeatures𝜆" :ith featureweight𝑓" :ithfeaturefunction
Thus,advicefromhospitalplaysthe importantrolefor this.
Phrase-basedSMT
11
InputSentence(S)
Thus,advicefromhospitalplaysthe importantrolefor this.
Phrase-basedSMT
12
Thus,advicefromthe hospitalplaysanimportantroleinthis.
OutputSentence(T*)
InputSentence(S)
UsefulGEC-specificFeatures
• IntroducedbyJunczys-Dowmunt andGrundkiewicz (CoNLL-2014SharedTask,EMNLP2016)
‣ WordClassLanguageModel‣ OperationSequenceModel‣ EditOperations‣ SparseEditOperationFeatures‣ AWeb-scaleLM
13
NeuralNetworkJointModel
• JointModel(JM)vsLanguageModel(LM)
• FeatureFunction:
𝑓 𝑇, 𝑆 = 𝑃 𝑇 𝑆 ≈ J𝑃(𝑡"|𝑠NO6, 𝑠N, 𝑠NP6, 𝑡"O6)|,|
"5614
3+2gramJM:𝑃(sat|cat,sit,in,cats)
SRC:Thecatsitinamat.
HYP:Thecatssatonthemat.
BigramLM:𝑃(sat|cats)
NeuralNetworkJointModel
• Usesafeed-forwardneuralnetwork(Devlinetal.,2014)
• 5+5gramNNJMforGECinChollampatt etal.(IJCAI2016andBEAWorkshop2017)
15
cat sit in cats
Output Vocabulary
P(targetword|context)
𝑃(sat|cat,sit,in,cats)
Es Es Es Et
NNJMAdaptation
Training:usingloglikelihoodwithselfnormalization.
Adaptation:addingKL-divergenceregularizationtermtolossfunction:
AdaptationData:üHigherqualityerrorannotationsüHighererror/sentenceratio
16
SMTforSpellingCorrection
17
• Addedasapostprocessingsteptotheword-levelSMT.
• Character-levelSMTgetstheunknownwordsfromtheSMTsystemandgeneratescandidates(maybenon-words)
• Rescoringwithlanguagemodeltofilterawaynon-wordcandidatesandpickbestcorrectionbasedoncontext.
utli sesuti li sesuti li zesuti li seuti li shes
Setup
DevelopmentData:‣ 5,458sentencesfromNUCLEwithatleast1error/sentence.
ParallelTrainingDataforWord-levelSMT:‣ Lang-8,NUCLE(2.21Msentences,26.77Msourcewords)
DataforCharacter-levelSMT:‣ UniquewordsinthecorrectedsideofNUCLEandthecorporaofmisspellings(http://www.dcs.bbk.ac.uk/~ROGER/corpora.html)
LMTrainingData:‣ Wikipedia(1.78Btokens),CommonCrawlLM(94Btokens)
18
Results
43.1645.90
49.25 51.70 53.1447.40 49.52
0.00
10.00
20.00
30.00
40.00
50.00
60.00
SMT-GEC +GECFEATURES
+WEB-SCALELM
+ADAPTEDNNJM
+SMTFORSPELLING
R&R(2016) J&G(2016)
19
R&R(2016):ROZOVSKAYAANDROTH(ACL2016)J&G(2016) :JUNCZYSDOWMUNTANDGRUNDKIEWICZ(EMNLP2016)
MultilayerConvolutionalEncoderandDecoderNeuralNetworkforGEC
20
Encoder-DecoderApproach
21
DECODERInput Sentence
ENCODER
ATTENTION
Output Sentence
PriorworkinGEC:RecurrentNeuralNetwork(RNN)-basedapproaches(Bahdanau etal.2015)
WeuseafullyConvolutionalNeuralNetwork(CNNs)-basedapproach(Gehring etal.2017)…
Encoder-DecoderApproach
22
AMultilayerConvolutionalEncoder-Decoder
23
AMultilayerConvolutionalEncoder-Decoder
EncoderConsistsofsevenlayers.
24
AMultilayerConvolutionalEncoder-Decoder
EncoderConsistsofsevenlayers.
ConvolutionOperation:𝐟"S = Conv 𝐡"O6SO6, 𝐡"SO6, 𝐡"P6SO6
25
AMultilayerConvolutionalEncoder-Decoder
EncoderConsistsofsevenlayers.
ConvolutionOperation:𝐟"S = Conv 𝐡"O6SO6, 𝐡"SO6, 𝐡"P6SO6
GatedLinearUnits(GLUs):GLU 𝐟"S = 𝐟",6:ZS + 𝜎 𝐟",ZP6:]ZS
26
AMultilayerConvolutionalEncoder-Decoder
EncoderConsistsofsevenlayers.
ConvolutionOperation:𝐟"S = Conv 𝐡"O6SO6, 𝐡"SO6, 𝐡"P6SO6
GatedLinearUnits(GLUs):GLU 𝐟"S = 𝐟",6:ZS + 𝜎 𝐟",ZP6:]ZS
ResidualConnections:𝐡"S = GLU 𝐟"S + 𝐡"SO6
27
AMultilayerConvolutionalEncoder-Decoder
DecoderConsistsofsevenlayers.
Consistsofconvolutionsandnon-linearities
28
AMultilayerConvolutionalEncoder-Decoder
DecoderConsistsofsevenlayers.
Consistsofconvolutionsandnon-linearities
+Attention:
𝛼_,"S = exp(𝐞"a𝐳_S )
∑ exp(𝐞da𝐳_S )ed56
𝐱_S = /𝛼_,"S (e
"56
𝐞" + 𝐬")29
Pre-trainingWordEmbeddings
30
• Wordembeddings arepre-trainedandinitialized.• TrainedusingfastText (Bojanowskietal.,2017)onWikipedia.• Usesunderlyingcharactern-gramsequencesofwords
AdvantagesüReliableembeddings canbeconstructedforrarerwords.üMorphologyofwordsisconsidered.
Ensembling andRe-scoring
• Ensembling multiplemodels,i.e.thelogprobabilitiesformultiplemodelsareaveragedduringpredictionofeachoutputword.
• Thefinalbeamcandidatesarere-scoredusingfeatures:• EditOperation(EO):#insertions,#deletions,#substitutions• LanguageModel(LM):web-scaleLMscore,#words
• FeatureweightstuningdonesimilartoSMT:MERToptimizingF0.5 onthedevelopmentdata.
31
ModelandTrainingDetails
• Data:AsinChollampatt andNg(BEA2017)exceptforusingonlyannotatedsentencepairsduringtraining.
• Vocabulary:30Kmostfrequentwordsonsourceandtargetside
• Numberofdimensionsofembeddings:500
• Numberofdimensionsof encoder/decoderoutputvectors:1024
32
Results
45.36 46.3849.33
54.13 53.14
45.1541.53 41.37
0
10
20
30
40
50
60
MULTILAYERCONVENC-DEC
PRE-TRAININGEMBEDDINGS
ENSEMBLEOF4MODELS
RE-SCORING(EO,LM)
CHOLLAMPATTANDNG(2017)
JI ETAL. (2017)WITHOUTLM
JIETAL. (2017) SCHMALTZETAL. (2017)
33
ChallengesandFutureWork
• Lackofgoodqualityparalleldata.
• Goingbeyondsentence-level.
• Adaptationtodiverselearners.
34
ThankYouEmail:[email protected]:shamilcm.github.io
35