Named Entity Recognition from a Data-Driven Perspectivecseweb.ucsd.edu/classes/fa19/cse259-a/files/jingbo.pdf · What’s Named Entity Recognition? 3 q Wikipedia: q Named-entity recognition

Named Entity Recognition from a Data-Driven Perspective

JingboShangComputerScienceEngineering&HalıcıoğluDataScienceInstitute

UniversityofCalifornia,SanDiego

1

Outline

2

q Background

q What’sNamedEntityRecognition(NER)?

q What’s“Data-Driven”?

q Data-DrivenNERMethods

What’s Named Entity Recognition?

3

q Wikipedia:q Named-entityrecognition (NER) isasubtaskof informationextraction

(IE) thatseekstolocate andclassify namedentities intextintopre-definedcategories.

q InIE,A namedentity isareal-worldobject.q Exampleq Inputq Jimbought300sharesofAcmeCorp.in2006.

q Outputq [Jim]Person bought300sharesof[AcmeCorp.]Organization in[2006]Time.

Supervised Methods: Training Data

4

q Sequencelabelingframeworkq Twopopularschemesq BIO:Begin,In,Outq BIOES:Begin,In,Out,End,Singletonq BIOESisarguablybetterthanBIO(Ratinov andRoth,ACL09)

q Example:q LABELS: [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.q TOKNES: Jim bought 300 shares of Acme Corp. in 2006 .q BIO: B-PER O O O O B-ORG I-ORG O B-Time Oq BIOES: S-PER O O O O B-ORG E-ORG O S-Time O

Supervised Methods: Neural Models

5

q Twopioneer modelsq LSTM-CRF(Lample etal.,NAACL’16)q LSTM-CNN-CRF(MaandHovy,ACL’16)

q Thefirstneuralmodelthatoutperformsthemodelsbasedonhandcraftedfeatures

LSTM-CRF LSTM-CNN-CRF

Word-Level Bidirectional LSTMs Bidirection LSTMs

Character-Level Bidirectional LSTMs ConvolutionalNN

“Data-Driven” Philosophy

6

q Keyq EnhanceNERperformancewithoutintroducinganyadditionalhuman

annotations

q Questionsq Canmassiverawtextshelp?q Candictionarieshelp?q Arehumanannotationsalwayscorrect?q IsTokenizer alwaysgood?q …

Questions

7

q Canmassiverawtextshelp?

q Candictionaries help?

q Arehumanannotationsalwayscorrect?

q IsTokenizer alwaysgood?

Word Embedding à Language Model (LM)

8

q Using Language Model for better representations:q Word-level Language Model:q ELMo (Petersetal.,NAACL’18, bestpaper)q LD-Net (Liu et al., EMNLP’18)

q Char-level Language Model:q LM-LSTM-CRF (Liu et al., AAAI’ 18)q Flair(Akbik etal.,COLING’18)

q Hybrid LanguageModel:q Cross View Training (Clark etal.,EMNLP’ 2018)q BERT (Devlin et al., NAACL’19, best paper)

What’s (Neural) Language Model?

9

Input words: Obama was born

Target words: was born in

1000

0100

0010

0.3-0.40.5

0.70.30.2

0.20.1-0.4

0.50.41.52.2

1.0-3.32.21.5

0.11.21.12.4

word embedding

q Describing the generation of text:q predicting the next word based on

previous contexts

q Pros:q Does not require any human annotationsq Nearlyunlimitedtrainingdata!

q Resulting models can generate sentencesof an unexpectedlyhighquality

Neural LM: Example Generations

10

q Char-by-CharMarkdownGenerations:

'''Seealso''':[[Listofethicalconsentprocessing]]

==Seealso==*[[Iender domeoftheED]]*[[Anti-autism]]

===[[Religion|Religion]]===*[[FrenchWritings]]*[[Maria]]*[[Revelation]]

ValidSyntax!

Neural LM: Example Generations

11

q Deep“DonaldTrump”:MimicPresidentTrump

Fooledmanytwitterusers

LM-LSTM-CRF: Co-Train Neural LM

12

q Propose to use Character-level language model as a Co-Training objectiveq Whycharacter-level?q Moreefficient& More robust to pre-processing

ELMo: Pre-train Word-Level Neural LM

13

q Add ELMo at the input of RNN. For some tasks (SNLI, SQuAD), includingELMo at the output brings further improvements

q Keypoints:q Freeze the weight of the biLMq Regularization are necessary

␣ V i n k e n ␣

embedding

lstm

down-projection

c1, c2,0 c2,1 c2,2 c2,3 c2,4 c2,5 c2,

Vinken

Vinken

Pierre ,

Pierre ,

Pierre ,Vinken

[ ]

E-PERy2

…

…

…

…

…

……

… …

…

…

…

…

…

…

…

CRF for Sequence LabelingBackward PTLM

Forward PTLM

fixed lstm

[ ] concatenate

w2

LD-Net: An efficient version of ELMo

14

q Make the contextualizedrepresentefficient withoutmuch loss of effectiveness

q How is this even possible?q Pre-trained language

model containsabundant information,however, for a specifictask, only part of it couldbe useful.

Flair: Pre-Train Neural LM at All Levels

15

q Even for character-level language model, pre-training is very important.q The structure is the same with LM-LSTM-CRF, the difference is the pre-

training conducted on additional training corpus.

BERT: Introduce Transformer

16

q Introduce Transformers, use masked language model + next sentenceprediction

q Conduct fine-tuning after pre-training on each task (necessary forsentence-level tasks, NER is a word level task).

New State-of-the-artsq Using Language Model for better representations:q Word-level Language Model:q ELMo(Petersetal.,NAACL’18, bestpaper)q LD-Net (Liu et al., EMNLP’18)

q Char-level Language Model:q LM-LSTM-CRF (Liu et al., AAAI’ 18)q Flair(Akbiketal.,COLING’18)

q HybridLanguageModel:q Cross View Training (Clark etal.,EMNLP’ 2018)q BERT (Devlin et al., NAACL’19, best paper)

92.292.0, ~5X faster

91.493.1

92.692.4 / 92.8

F1 on CoNLL03

17

Questions

18

q Canmassiverawtextshelp?à Neurallanguagemodel

q Candictionaries help?



Distantly Supervised NER

19

q Inputq UnlabeledRawTextsq AnEntityDictionaryq entitytype,canonicalname,[synonyms_1,synonyms_2,…,synonyms_k]

q Outputq ANERmodeltorecognizetheentitiesoftheentitytypesappearedinthe

givendictionary.qNotethattheentitiestoberecognizedcanbeunseenentities.

Distantly Supervised NER Methods

20

q String-match/rule-baseddistantsupervisiongeneration

q AutoEntity,SwellShark,ClusType,…q Leavetheentityspandetectiontoexpertsq POSTagRule-based(e.g.,regularexpressions)

q Distant-LSTM-CRFq LeverageAutoPhrase toextract“aspectterms”

q AutoNERq Anovel“Tie-or-Break”labelingscheme+tailoredneuralmodel

SwellShark: Distantly Supervised Typing

21

q DataProgrammingforTyping

q EntitySpanDetection:Regularexpressionsbasedonpart-of-speech(POS)tags

q Requiresexperteffortsq CandidateGenerators

Distant-LSTM-CRF: Use Phrase Mining as Supervision + LSTM-CRF

22

q AutoPhrase+LSTM-CRFq AutoPhrasegenerateslabelsq Heuristicallysetthresholds

q LSTM-CRFbuildsmodelsq Bothword&charinfoareused

q Problemq Highthresholdsneededfor

cleanpositivelabelsèmanyfalse-negativelabels

AutoNER: Dual Dictionaries

23

q Acoredictionaryq Leadstohigh-precisionbutlow-recallmatches

q A“full”dictionaryq Leadstohigh-recallbutlow-precisionmatchesq Introduceout-of-dictionaryhigh-qualityphrasesasnewentitiesq Theirtypesare“unknown”q ItcouldbeanyIOBES+anytype

AutoNER: Fuzzy-LSTM-CRF Baseline

24

AutoNER: “Tie or Break”

25

q Insteadoflabelingeachtoken,wechoosetotagtheconnectionbetweentwoadjacenttokens.

q Foreverytwoadjacenttokens,theconnectionbetweenthemislabeledasq (1)Tie,whenthetwotokensarematchedtothesameentityq (2)Unknown,ifatleastoneofthetokensbelongstoanunknown-typed

high-qualityphrase;q (3)Break,otherwise.

AutoNER: Tailored Neural Model

26

Comparison – Biomedical Domain

27

Questions

28


q Candictionaries help? à Distantsupervisedsetting



Typical Annotation Mistakes in CoNLL03

29

q State-of-the-artF1scoreonthistestsetisalreadyaround93%q ~5.38% test sentenceshaveannotationmistakesq Significantamount!

Evaluation on Corrected Test Set

30

q HigherF1scorewithsmallervarianceq Betterreflectstherealperformanceq Thiscorrectedtestsetshouldbeadoptedinfutureresearch

CrossWeigh: Handle Noisy Training Set

31

Key Problem: How to Partition k-Folds?

32

q RandomPartitionmaybeineffectiveq NeuralNERmodelswilloverfit theannotationmistakesobserved

duringtraining

q EntityDisjointFilteringq Ineachfold,ifa“training”sentencecontainsanyentitiesappearedin

the“testing”set,itwillbediscardedduringthe“training”

CrossWeigh: Evaluation

33

q CrossWeigh iseffectivewithmanyNERmodels

q EntityDisjointFilteringisimportant q Twitter&Low-resource

CrossWeigh: Identify Annotation Mistakes

34

q CoNLL03 train,dev &testasasupertrainingsetq Apply CrossWeigh toidentifyannotationmistakes onthetestsetq Evaluateagainst186humancorrectionsq Almost80%ofmistakescanbedetected

Questions

35


q Candictionaries help? à Distantsupervisedsetting

q Arehumanannotationsalwayscorrect?à Auto-Correction


Typical NER Pipeline System

36

q Pre-processingtoolsareappliedfirst

An Interesting Observation

37

q BroadTwitterCorpus(BTC)q AtwitterNERdataset

q spaCyq ApopularPython NLPlib

q spaCy tokenization+BTCdatasetq àWordboundariesofmorethan45% namedentitieswillbe

incorrectlyidentified!

Neural-Char-CRF: Raw-to-End

38

q WeproposetoconductNERtraininginaraw-to-endmannerq Rawtextastheinput&Predictionsatthecharacterlevel

Neural-Char-CRF: String Match

39

q PrefertomatchthewordswithhigherInverseDocumentFrequency(IDF)

Neural-Char-CRF: Character-Level LM

40

q Character-level neurallanguagemodelisleveragedq Pre-training +Contextualizedrepresentations

Comparison – Twitter NER Datasets

41

q Tokenizer mattersq NLTKisthebestonboth

datasetsq Raw-to-Endwinsq StringMatch isevenbetter

Summary & Q&A

42

q Usingneurallanguagemodel,massiverawtextscanhelp!

q High-qualitydictionariescanhelp!

q HumanannotationsareNOTalwayscorrect!

q Tokenizer isnotthatimportantandsometimesevenhurts!

Documents

Named Entity Recognition from a Data-Driven Perspectivecseweb.ucsd.edu/classes/fa19/cse259-a/files/jingbo.pdf · What’s Named Entity Recognition? 3 q Wikipedia: q Named-entity recognition