Upload
others
View
44
Download
0
Embed Size (px)
Citation preview
Named Entity Recognition from a Data-Driven Perspective
JingboShangComputerScienceEngineering&HalıcıoğluDataScienceInstitute
UniversityofCalifornia,SanDiego
1
Outline
2
q Background
q What’sNamedEntityRecognition(NER)?
q What’s“Data-Driven”?
q Data-DrivenNERMethods
What’s Named Entity Recognition?
3
q Wikipedia:q Named-entityrecognition (NER) isasubtaskof informationextraction
(IE) thatseekstolocate andclassify namedentities intextintopre-definedcategories.
q InIE,A namedentity isareal-worldobject.q Exampleq Inputq Jimbought300sharesofAcmeCorp.in2006.
q Outputq [Jim]Person bought300sharesof[AcmeCorp.]Organization in[2006]Time.
Supervised Methods: Training Data
4
q Sequencelabelingframeworkq Twopopularschemesq BIO:Begin,In,Outq BIOES:Begin,In,Out,End,Singletonq BIOESisarguablybetterthanBIO(Ratinov andRoth,ACL09)
q Example:q LABELS: [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.q TOKNES: Jim bought 300 shares of Acme Corp. in 2006 .q BIO: B-PER O O O O B-ORG I-ORG O B-Time Oq BIOES: S-PER O O O O B-ORG E-ORG O S-Time O
Supervised Methods: Neural Models
5
q Twopioneer modelsq LSTM-CRF(Lample etal.,NAACL’16)q LSTM-CNN-CRF(MaandHovy,ACL’16)
q Thefirstneuralmodelthatoutperformsthemodelsbasedonhandcraftedfeatures
LSTM-CRF LSTM-CNN-CRF
Word-Level Bidirectional LSTMs Bidirection LSTMs
Character-Level Bidirectional LSTMs ConvolutionalNN
“Data-Driven” Philosophy
6
q Keyq EnhanceNERperformancewithoutintroducinganyadditionalhuman
annotations
q Questionsq Canmassiverawtextshelp?q Candictionarieshelp?q Arehumanannotationsalwayscorrect?q IsTokenizer alwaysgood?q …
Questions
7
q Canmassiverawtextshelp?
q Candictionaries help?
q Arehumanannotationsalwayscorrect?
q IsTokenizer alwaysgood?
Word Embedding à Language Model (LM)
8
q Using Language Model for better representations:q Word-level Language Model:q ELMo (Petersetal.,NAACL’18, bestpaper)q LD-Net (Liu et al., EMNLP’18)
q Char-level Language Model:q LM-LSTM-CRF (Liu et al., AAAI’ 18)q Flair(Akbik etal.,COLING’18)
q Hybrid LanguageModel:q Cross View Training (Clark etal.,EMNLP’ 2018)q BERT (Devlin et al., NAACL’19, best paper)
What’s (Neural) Language Model?
9
Input words: Obama was born
Target words: was born in
1000
0100
0010
0.3-0.40.5
0.70.30.2
0.20.1-0.4
0.50.41.52.2
1.0-3.32.21.5
0.11.21.12.4
word embedding
q Describing the generation of text:q predicting the next word based on
previous contexts
q Pros:q Does not require any human annotationsq Nearlyunlimitedtrainingdata!
q Resulting models can generate sentencesof an unexpectedlyhighquality
Neural LM: Example Generations
10
q Char-by-CharMarkdownGenerations:
'''Seealso''':[[Listofethicalconsentprocessing]]
==Seealso==*[[Iender domeoftheED]]*[[Anti-autism]]
===[[Religion|Religion]]===*[[FrenchWritings]]*[[Maria]]*[[Revelation]]
ValidSyntax!
Neural LM: Example Generations
11
q Deep“DonaldTrump”:MimicPresidentTrump
Fooledmanytwitterusers
LM-LSTM-CRF: Co-Train Neural LM
12
q Propose to use Character-level language model as a Co-Training objectiveq Whycharacter-level?q Moreefficient& More robust to pre-processing
ELMo: Pre-train Word-Level Neural LM
13
q Add ELMo at the input of RNN. For some tasks (SNLI, SQuAD), includingELMo at the output brings further improvements
q Keypoints:q Freeze the weight of the biLMq Regularization are necessary
␣ V i n k e n ␣
embedding
lstm
down-projection
c1, c2,0 c2,1 c2,2 c2,3 c2,4 c2,5 c2,
Vinken
Vinken
Pierre ,
Pierre ,
Pierre ,Vinken
[ ]
E-PERy2
…
…
…
…
…
……
… …
…
…
…
…
…
…
…
CRF for Sequence LabelingBackward PTLM
Forward PTLM
fixed lstm
[ ] concatenate
w2
LD-Net: An efficient version of ELMo
14
q Make the contextualizedrepresentefficient withoutmuch loss of effectiveness
q How is this even possible?q Pre-trained language
model containsabundant information,however, for a specifictask, only part of it couldbe useful.
Flair: Pre-Train Neural LM at All Levels
15
q Even for character-level language model, pre-training is very important.q The structure is the same with LM-LSTM-CRF, the difference is the pre-
training conducted on additional training corpus.
BERT: Introduce Transformer
16
q Introduce Transformers, use masked language model + next sentenceprediction
q Conduct fine-tuning after pre-training on each task (necessary forsentence-level tasks, NER is a word level task).
New State-of-the-artsq Using Language Model for better representations:q Word-level Language Model:q ELMo(Petersetal.,NAACL’18, bestpaper)q LD-Net (Liu et al., EMNLP’18)
q Char-level Language Model:q LM-LSTM-CRF (Liu et al., AAAI’ 18)q Flair(Akbiketal.,COLING’18)
q HybridLanguageModel:q Cross View Training (Clark etal.,EMNLP’ 2018)q BERT (Devlin et al., NAACL’19, best paper)
92.292.0, ~5X faster
91.493.1
92.692.4 / 92.8
F1 on CoNLL03
17
Questions
18
q Canmassiverawtextshelp?à Neurallanguagemodel
q Candictionaries help?
q Arehumanannotationsalwayscorrect?
q IsTokenizer alwaysgood?
Distantly Supervised NER
19
q Inputq UnlabeledRawTextsq AnEntityDictionaryq entitytype,canonicalname,[synonyms_1,synonyms_2,…,synonyms_k]
q Outputq ANERmodeltorecognizetheentitiesoftheentitytypesappearedinthe
givendictionary.qNotethattheentitiestoberecognizedcanbeunseenentities.
Distantly Supervised NER Methods
20
q String-match/rule-baseddistantsupervisiongeneration
q AutoEntity,SwellShark,ClusType,…q Leavetheentityspandetectiontoexpertsq POSTagRule-based(e.g.,regularexpressions)
q Distant-LSTM-CRFq LeverageAutoPhrase toextract“aspectterms”
q AutoNERq Anovel“Tie-or-Break”labelingscheme+tailoredneuralmodel
SwellShark: Distantly Supervised Typing
21
q DataProgrammingforTyping
q EntitySpanDetection:Regularexpressionsbasedonpart-of-speech(POS)tags
q Requiresexperteffortsq CandidateGenerators
Distant-LSTM-CRF: Use Phrase Mining as Supervision + LSTM-CRF
22
q AutoPhrase+LSTM-CRFq AutoPhrasegenerateslabelsq Heuristicallysetthresholds
q LSTM-CRFbuildsmodelsq Bothword&charinfoareused
q Problemq Highthresholdsneededfor
cleanpositivelabelsèmanyfalse-negativelabels
AutoNER: Dual Dictionaries
23
q Acoredictionaryq Leadstohigh-precisionbutlow-recallmatches
q A“full”dictionaryq Leadstohigh-recallbutlow-precisionmatchesq Introduceout-of-dictionaryhigh-qualityphrasesasnewentitiesq Theirtypesare“unknown”q ItcouldbeanyIOBES+anytype
AutoNER: Fuzzy-LSTM-CRF Baseline
24
AutoNER: “Tie or Break”
25
q Insteadoflabelingeachtoken,wechoosetotagtheconnectionbetweentwoadjacenttokens.
q Foreverytwoadjacenttokens,theconnectionbetweenthemislabeledasq (1)Tie,whenthetwotokensarematchedtothesameentityq (2)Unknown,ifatleastoneofthetokensbelongstoanunknown-typed
high-qualityphrase;q (3)Break,otherwise.
AutoNER: Tailored Neural Model
26
Comparison – Biomedical Domain
27
Questions
28
q Canmassiverawtextshelp?à Neurallanguagemodel
q Candictionaries help? à Distantsupervisedsetting
q Arehumanannotationsalwayscorrect?
q IsTokenizer alwaysgood?
Typical Annotation Mistakes in CoNLL03
29
q State-of-the-artF1scoreonthistestsetisalreadyaround93%q ~5.38% test sentenceshaveannotationmistakesq Significantamount!
Evaluation on Corrected Test Set
30
q HigherF1scorewithsmallervarianceq Betterreflectstherealperformanceq Thiscorrectedtestsetshouldbeadoptedinfutureresearch
CrossWeigh: Handle Noisy Training Set
31
Key Problem: How to Partition k-Folds?
32
q RandomPartitionmaybeineffectiveq NeuralNERmodelswilloverfit theannotationmistakesobserved
duringtraining
q EntityDisjointFilteringq Ineachfold,ifa“training”sentencecontainsanyentitiesappearedin
the“testing”set,itwillbediscardedduringthe“training”
CrossWeigh: Evaluation
33
q CrossWeigh iseffectivewithmanyNERmodels
q EntityDisjointFilteringisimportant q Twitter&Low-resource
CrossWeigh: Identify Annotation Mistakes
34
q CoNLL03 train,dev &testasasupertrainingsetq Apply CrossWeigh toidentifyannotationmistakes onthetestsetq Evaluateagainst186humancorrectionsq Almost80%ofmistakescanbedetected
Questions
35
q Canmassiverawtextshelp?à Neurallanguagemodel
q Candictionaries help? à Distantsupervisedsetting
q Arehumanannotationsalwayscorrect?à Auto-Correction
q IsTokenizer alwaysgood?
Typical NER Pipeline System
36
q Pre-processingtoolsareappliedfirst
An Interesting Observation
37
q BroadTwitterCorpus(BTC)q AtwitterNERdataset
q spaCyq ApopularPython NLPlib
q spaCy tokenization+BTCdatasetq àWordboundariesofmorethan45% namedentitieswillbe
incorrectlyidentified!
Neural-Char-CRF: Raw-to-End
38
q WeproposetoconductNERtraininginaraw-to-endmannerq Rawtextastheinput&Predictionsatthecharacterlevel
Neural-Char-CRF: String Match
39
q PrefertomatchthewordswithhigherInverseDocumentFrequency(IDF)
Neural-Char-CRF: Character-Level LM
40
q Character-level neurallanguagemodelisleveragedq Pre-training +Contextualizedrepresentations
Comparison – Twitter NER Datasets
41
q Tokenizer mattersq NLTKisthebestonboth
datasetsq Raw-to-Endwinsq StringMatch isevenbetter
Summary & Q&A
42
q Usingneurallanguagemodel,massiverawtextscanhelp!
q High-qualitydictionariescanhelp!
q HumanannotationsareNOTalwayscorrect!
q Tokenizer isnotthatimportantandsometimesevenhurts!