56
Detection of MultiWord Expression and Name Entity Recognition Multilingual Multiword Expression by Dhirendra Pratap Singh & Rudra Murthy & Dr. Pushpak Bhattacharyya CSE Dept. Indian Institute of Technology Bombay

by Dhirendra Pratap Singh & Rudra Murthy & Dr. …•Dhirendra P. Singh, Sudha Bhingardive and Pushpak Bhattacharyya, “Detectionof MultiWord Expression Using Word Embeddings and

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

  • Detection of MultiWord Expression and Name Entity Recognition

    Multilingual Multiword Expressionby

    Dhirendra Pratap Singh & Rudra Murthy & Dr. Pushpak Bhattacharyya

    CSE Dept.

    Indian Institute of Technology Bombay

  • Introduction

    Research Problem

    Motivation

    Classification and Characteristics of MWEs

    MWEs Detection Approach

    Experiment and Results

    Reference

    Detection of MultiWord Expression and Name Entity Recognition

    Outline

  • Introduction

    MultiWord Expressions:• a group of two or more words when comes together and acts as a single semantic unit

    • based on the various linguistic perspectives like lexical, syntactic, semantic, or purely statistical

    Some MWEs in English:• Cloud nine

    • Kick the bucket

    • Swimming Pool

    Some MWEs in Hindi:• धन दौलत (dhana daulat, wealth)• चाय पानी (chai paani, snacks)• नौ दो ग्यारा होना (nau do gyara hona, run away)

    Detection of MultiWord Expression and Name Entity Recognition

  • Identification- How can we locate the tokens that correspond to MWEs?

    - Unfortunately, he X the bucket

    - X – located, not a MWE

    - X – kicked, a MWE

    Disambiguation

    - Is it really a MWE in the current context?

    - India and Pakistan broke bridges over the Mumbai blast issue

    - India and Pakistan broke bridges over the Wagah border

    Research Problems

    Detection of MultiWord Expression and Name Entity Recognition

  • Many NLP applications face problems due to MWEs

    Machine Translation

    Information Retrieval

    Motivation

    Detection of MultiWord Expression and Name Entity Recognition

  • Machine Translation

    मैंने धोखा खाया [Hindi → English]Google: I cheat eat

    Correct: I was cheated

    She kicked the bucket [English → Hindi]

    Google: वह बाल्टी लात मारीCorrect: वह मर गयी

    Motivation contd..

    Detection of MultiWord Expression and Name Entity Recognition

  • Information Retrieval

    Query: “burned bridges”

    Google: Incidents of burning bridges

    Actual: Incidents of broken ties

    Detection of MultiWord Expression using Word Embeddings and WordNet-based Features

    Motivation contd..

    Detection of MultiWord Expression and Name Entity Recognition

  • 8

    MWEs Characteristics Compositionality

    • Partially Compositional

    Compositionality refers to the meaning of

    their constituent words. E.g. तरण ताल (swimming pool), धन लक्ष्मी

    (wealth), चाय पानी (snacks), etc.

    • Non-Compositional

    Non-compositionality cannot be

    completely determined from the meaning

    of its constituent words.E.g. काली जबुान, दम तोडना (pass away), Cloud

    Nine, etc.

    Idioms • Decomposability

    Spill the bean

    • Non- Decomposability

    Kick the bucket

    CollocationCollocation: They are fixed expressions and appear very

    frequently in running text. E.g. कड़क चाय (strong tea), काला धन (black money), etc.

    Non-SubstitutabilityIn Non-Substitutability, words cannot be substituted by its

    synonymsE.g. अकं पत्र , क्षय-तिति , स्वचे्छा-मतृ्यु

    Detection of MultiWord Expression and Name Entity Recognition

  • 9

    MWEs Classification (Sag et. al, 2002 )

    Detection of MultiWord Expression and Name Entity Recognition

  • MWE Detection Approach

    Hindi WordNet-based Feature approach

    Word Embeddings approach

    Using WordNet and Word Embeddings with Exact match

    Detection of MultiWord Expression and Name Entity Recognition

  • Hindi Wordnet is the most useful

    lexical resource for Indian languages

    It is a lexical structure composed of

    synsets, semantic and lexical relations

    It can be used in various NLP

    application

    http://www.cfilt.iitb.ac.in/wordnet/hwn/

    Hindi WordNet-based approach

    Detection of MultiWord Expression using Word Embeddings and WordNet-based Features

    http://www.cfilt.iitb.ac.in/wordnet/hwn/

  • Hindi WordNet-based approach contd..

    Using WordNet-based Features:

    Consider a word pair w1w2

    • 𝑩𝑶𝑾 𝒘𝟏 = 𝑾’ 𝑾’ 𝝐 𝑾𝒐𝒓𝒅𝑵𝒆𝒕 − 𝒃𝒂𝒔𝒆𝒅 𝑭𝒆𝒂𝒕𝒖𝒓𝒆𝒔 𝒘𝟏• 𝑩𝑶𝑾 𝒘𝟐 = 𝑾’ 𝑾’ 𝝐 𝑾𝒐𝒓𝒅𝑵𝒆𝒕 − 𝒃𝒂𝒔𝒆𝒅 𝑭𝒆𝒂𝒕𝒖𝒓𝒆𝒔 𝒘𝟐

    Where, WordNet Feature (wi) contains all content words from synonyms, gloss, example(s), hypernyms, hyponyms, meronyms, antonyms with respect to the word wi. We consider only one level of hierarchy for extracting these semantics features.

    If 𝑤1 ϵ 𝐵𝑂𝑊 𝑤2 , 𝑡ℎ𝑒𝑛 𝑤1𝑤2 𝑖𝑠 𝑎 𝑀𝑊𝐸if 𝑤2 ϵ 𝐵𝑂𝑊 𝑤1 , 𝑡ℎ𝑒𝑛 𝑤1𝑤2 𝑖𝑠 𝑎 𝑀𝑊𝐸

    Detection of MultiWord Expression using Word Embeddings and WordNet-based Features

  • Detection of CjVs using IndowordNet-based features approach

    We are detecting Light Verb Constructions using Ontological features from IndoWordNet:

    Consider a word pair w1w2

    Detection of MultiWord Expression using WordNet-based Features

  • Detection of CjVs using IndowordNet-based features approach

    We are detecting Light Verb Constructions using Ontological features from IndoWordNet:

    Consider a word pair w1w2

    Detection of MultiWord Expression using WordNet-based Features

  • Experiment and result

    Languages Total Pairs (N+N) F-score

    Hindi 1000 0.58

    Marathi 1000 0.72

    Bengali 1000 0.53

    Punjabi 1000 0.43

    Konkani 1000 0.53

    Odiya 1000 0.38

    Assamese 1000 0.40

    Compound Nouns (CNs)

    Table 1: Result of Compound Noun Detection

    Detection of MultiWord Expression using WordNet-based Features

  • Experiment and result

    Conjunct Verb (CjVs)

    Table 1: Result of Conjunct Verb Detection

    Languages Total pairs

    (N+V)

    F-score Total

    pairs(Adj+V)

    F-score

    Hindi 457 0.87 577 0.89

    Marathi 404 0.86 502 0.88

    Bengali 797 0.87 303 0.92

    Punjabi 1017 0.8 307 0.9

    Konkani 879 0.84 269 0.95

    Odia 832 0.85 368 0.91

    Assamese 703 0.84 259 0.94

    Languages Total Pairs (V+V) F-score

    Hindi 399 0.99

    Marathi 504 0.88

    Table 2: Result of Compound Verb Detection

    Conjunct Verb (CjVs)

    Detection of MultiWord Expression using WordNet-based Features

  • Word Embeddings are based on the Distributional Hypothesis which work under the assumption that

    similar words occur in similar contexts (Harris, 1954)

    They represent each word with a low-dimensional real valued vector with similar words occurring

    closer in that space

    Word2vec tool is used for obtaining the word embeddings

    It captures many linguistic regularities among words, for example,

    Vector(‘king’) – Vector(‘man’) + Vector[‘women’] => Vector(‘queen’)

    Word Embeddings for Hindi: They are trained on Bojar’s (2014) corpus (44 M sentences) with the Skip-

    gram model, 200-dimensions, and the window size as 7

    Word Embeddings: Linguistic regularities among words

    Word Embedding approach

    Detection of MultiWord Expression using Word Embeddings and WordNet-based Features

  • Word Cosine Distance

    फ़ल 0.840545केला 0.705185

    सीताफल 0.685993पपीता 0.682171

    सौन्दययवर्यक 0.677420कन्दमूल 0.672466अननास 0.655930भाजियााँ 0.650811आडू 0.650100

    Following are the closest words to a word फल in the corpus obtained using word2vec tool

    Word Embedding approach

    Detection of MultiWord Expression using Word Embeddings and WordNet-based Features

  • Hindi WordNet-based approach contd..

    Using Word embeddings:

    Consider a word pair w1w2

    • 𝑩𝑶𝑾 𝒘𝟏 = 𝑾’ 𝑾’ 𝝐 𝑰𝒔𝑨𝑵𝒆𝒊𝒈𝒉𝒃𝒐𝒖𝒓 𝒘𝟏• 𝑩𝑶𝑾 𝒘𝟐 = 𝑾’ 𝑾’ 𝝐 𝑰𝒔𝑨𝑵𝒆𝒊𝒈𝒉𝒃𝒐𝒖𝒓 𝒘𝟐

    Where, IsANeighbour(wi.) returns the top 20 neighbours of wi (according to cosine similarity of corresponding vectors).

    If 𝑤1 ϵ 𝐵𝑂𝑊 𝑤2 , 𝑡ℎ𝑒𝑛 𝑤1𝑤2 𝑖𝑠 𝑎 𝑀𝑊𝐸if 𝑤2 ϵ 𝐵𝑂𝑊 𝑤1 , 𝑡ℎ𝑒𝑛 𝑤1𝑤2 𝑖𝑠 𝑎 𝑀𝑊𝐸

    Detection of MultiWord Expression using Word Embeddings and WordNet-based Features

  • Using WordNet and Word Embeddings with Exact match

    Using WordNet and Word Embeddings with Exact match:

    • 𝑾𝑵𝑩𝒐𝒘𝟏 = 𝑾′ 𝑾′ = 𝑰𝒔 𝒘𝒐𝒓𝒅𝒏𝒆𝒕 − 𝒃𝒂𝒔𝒆𝒅 𝒇𝒆𝒂𝒕𝒖𝒓𝒆𝒔 𝒘𝟏

    • 𝑾𝑬𝑩𝒐𝒘𝟐 = 𝑾′ 𝑾′ = 𝑰𝒔 𝒂 𝑵𝒆𝒊𝒈𝒉𝒃𝒐𝒖𝒓 𝒘𝟐

    • 𝑾𝑵𝑩𝒐𝒘𝟏 ∩ 𝑾𝑬𝑩𝒐𝒘𝟐 ≠ 𝝓,⇒ 𝒘𝟏𝒘𝟐 𝒊𝒔 𝒂 𝑴𝑾𝑬

    Detection of MultiWord Expression using Word Embeddings and WordNet-based Features

  • Evaluation of quality of word

    embeddings:• Word embeddings, that were trained on

    Bojar corpus, are evaluated on the word-

    pair similarity dataset

    • Agreement among the human annotators

    was approx. 0.73

    • Agreement between word embeddings

    (word2vec tool) and human annotators was

    approx. 0.61Table 1: Agreement of different entities on the translated

    similarity dataset for Hindi

    Entities Agreement

    Human1/Human2 0.74

    Human1/Human3 0.68

    Human 2/Human3 0.77

    Word2vec/Human1 0.65

    Word2vec/Human2 0.54

    Word2vec/Human3 0.63

    Experiment and result

    Detection of MultiWord Expression using Word Embeddings and WordNet-based Features

  • Evaluation of our approaches for MWEs detections:• As is evident from Table 2 and Table 3, WordNet based approaches perform comparatively better

    • Word2vec approach performs relatively close

    Results of Noun+Noun compounds on Hindi Dataset

    Results of Noun+Verb compounds on Hindi Dataset

    Techniques Resources used P R F-score

    Approach 1 WordNet 0.79 0.77 0.78

    Approach 2 Word2Vec 0.75 0.64 0.69

    Approach 3 Word2Vec+WordNet 0.76 0.68 0.72

    Techniques Resources used P R F-score

    Approach 1 WordNet 0.75 0.82 0.78

    Approach 2 Word2Vec 0.56 0.75 0.64

    Approach 3 Word2Vec+WordNet 0.57 0.58 0.58

    Experiment and result

    Detection of MultiWord Expression using Word Embeddings and WordNet-based Features

  • Our work suggests that Wordnets definitely help in identification of MWEs.

    Our investigation also shows that word embedding based approaches perform well too.

    • This is helpful especially in the context of those languages whose Wordnets are incomplete.

    Survey behavior of MWEs across languages

    Study of Linguistic features that can assist identification of MWEs

    Summary and Future Work

    Detection of MultiWord Expression using Word Embeddings and WordNet-based Features

  • Publications

    Detection of MultiWord Expression using Word Embeddings and WordNet-based Features

    • Dhirendra P. Singh, Sudha Bhingardive and Pushpak Bhattacharyya, “Detection of Light Verb ConstructionsUsing WordNet”, Global WordNet Conference, (GWC 2016), Romania, 2016

    • Sudha Bhingardive, Hanumant Redkar, Prateek Sappadla, Dhirendra P. Singh and Pushpak Bhattacharyya.“IndoWordNet-based Semantic Similarity Measurement”, Global WordNet Conference, (GWC 2016), Romania,2016.

    • Dhirendra P. Singh, Sudha Bhingardive and Pushpak Bhattacharyya, “Detection of MultiWord Expression UsingWord Embeddings and WordNet based features”, International Conference on Natural Language Processing,(ICON 2015), India

    • Sudha Bhingardive, Dhirendra P. Singh, Rudramurthy R and Pushpak Bhattacharyya. “Using Word Embeddingsfor Bilingual Unsupervised WSD”, International Conference on Natural Language Processing, (ICON 2015),India.

    • Sudha Bhingardive, Dhirendra P. Singh, Rudramurty V, Hanumnat Redkar and Pushpak Bhattacharyya,“Unsupervised Most Frequent Sense Detection using Word Embeddings”, North American Chapter of theAssociation for Computational Linguistics – Human Language Technologies (NAACL HLT 2015) , Denver,Colorado, USA.

  • Publications

    Detection of MultiWord Expression using Word Embeddings and WordNet-based Features

    • Sudha Bhingardive, Ratish Puduppully , Dhirendra P. Singh and Pushpak Bhattacharyya. “Merging Senses ofHindi WordNet using Word Embeddings”, International Conference on Natural Language Processing, (ICON2014), Goa,India.

    • Dhirendra P. Singh, Sudha Bhingardive, Kevin Patel and Pushpak Bhattacharyya, “Using Word Embeddings andWordNet features for MultiWord Expression Extraction”, Linguistic Society of India (LSI 2015), JNU, Delhi,India.

    • Dhirendra P. Singh, ‘Linguistics Behavior of ‘Hindi Verb Collator’ in the context of Machine Translation’(Research Journal), 2013.

  • Christiane Fellbaum, “WordNet. An electronic lexical database”, Cambridge, MA: MIT Press; 1998.

    Pushpak Bhattacharyya, “IndoWordNet”, LREC, 2010.

    Darren Pearce, “Using conceptual similarity for collocation extraction”, Proceedings of the Fourth annual CLUK

    colloquium, 2001.

    Frank Smadja, “Retrieving collocations from text:xtract”, Computational Linguistics, 1993.

    Tanmoy Chakraborty and Sivaji Bandyopadhyay, “Identification of Reduplication in Bengali Corpus and their

    Semantic Analysis : A Rule Based Approach”, Proceedings of the Multiword Expressions: From Theory to Applications,

    2010.

    Carlos, Ramisch, Aline Villavicencio, and Christian Boitet, “mwetoolkit: a Framework for Multiword Expression

    Identification.”, In Proc. of the Seventh LREC (LREC 2010).

    Ramischy Carlos, Aline Villavicencio, and Christian Boitet, “Multiword Expressions in the wild? mwetoolkit comes in

    handy”, COLING. 2010.

    Veronika Vincze, Istvan Nagy T, and Gabor Berend, “Detecting noun compounds and light verb constructions: a

    contrastive study”, Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real

    World (MWE 2011).

    References

    Detection of MultiWord Expression using Word Embeddings and WordNet-based Features

  • Introduction to Name Entity Recognition(NER)

    Rudra Murthy, Dhirendra and Dr. Pushpak Bhattacharyya

    CSE dept.

    Indian Institute of Technology Bombay

    6/22/2016 Indian Institute of Technology Bombay 27

  • Goal of this talk

    • Focus on NER

    • Give the fundamental pointers for you to develop your NER system using various approaches.

    • Understand how NER system fits in different NLP applications like

    - Question Answering system

    - Information Retrieval

    - Information Extraction

    - Machine Translation

    ……

    …....

    6/22/2016 Indian Institute of Technology Bombay 28

  • • Introduction

    • Motivation

    • Modelling

    • Conclusion

    Roadmap

    6/22/2016 Indian Institute of Technology Bombay 29

  • IntroductionName Entity Recognition (NER) ?

    • The task: identify lexical and phrasal information in text which express

    references to named entities NE

    • A very important sub-task: find and classify names in the running text.

    • Used NLP layers

    • Requires large amount of labeled training data. Costly and time-consuming.

    • Upshot: for many widely spoken languages e.g. Indian languages, no NER

    systems freely available.

    6/22/2016 Indian Institute of Technology Bombay 30

  • NER classes Sample Categories

    Name Person, organization, Location, Facilities, Artifact, Entertainment, Organisms,

    Plants, Diseases, Cuisines, Locomotives

    Time Time, Year, Month, Date, Day, Period and Special day are considered as Time

    expressions.

    Numerical

    expression

    Numerical expressions are categorized as Distance, Money, Quantity and Count

    miscellaneous

    NER classes?

    The Named entity hierarchy is divided into four major classes; Name, Time and

    Numerical expressions or other miscellaneous entities in a given running text

    6/22/2016 Indian Institute of Technology Bombay 31

  • NER is the task of finding names and classifying them into person, location,

    organization, or other miscellaneous entities in a given running text.

    Examples are:

    - Sachin Tendulakar is the star batsman for India

    - Mohammed Amir granted UK visa

    - Mohammed Amir granted UK visa

    - I am the student of Indian Institute of Technology

    6/22/2016 Indian Institute of Technology Bombay 32

  • • Question Answering (QA)

    • Machine Translation (MT)

    • Information Retrieval (IR)

    • ……

    • ……

    Why NER ?

    6/22/2016 Indian Institute of Technology Bombay 33

  • • What is QA system?

    • QA is the system which is concerned about the giving answer automatically

    posted by humans in natural language.

    • QA system:

    - Contain NER as a core components.

    - NER task of finding some of the answers is simplified considerably.

    NER in QA system ?

    6/22/2016 Indian Institute of Technology Bombay 34

  • 3

    5

    QA

    System

    Knowledge

    Bases

    Question: Where did Sachin

    Tendulkar played his first test

    match ?

    Answer: Pakistan

    6/22/2016 Indian Institute of Technology Bombay

  • NER in MT system ?

  • Motivation

    • Dictionary Lookup

    6/22/2016 Indian Institute of Technology Bombay 37

  • Dictionary Lookup ?

    Have a dictionary of all person names, location names, organization names

    or miscellaneous entities like sports team, political party name etc.

    Given a sentence, search in the dictionary to see if there are any phrases

    which appear in the dictionary

    Example:

    Greenland witnesses hottest June on record

    6/22/2016 Indian Institute of Technology Bombay 38

  • Dictionary Lookup ?

    Have a dictionary of all person names, location names, organization names

    or miscellaneous entities like sports team, political party name etc.

    Given a sentence, search in the dictionary to see if there are any phrases

    which appear in the dictionary

    Example:

    Mamata Banerjee eyes Tata booster shot despite Singur fight

    I was prosecuted to shield Tata: 2G accused Balwa

    What should be the entity label for Tata?

    6/22/2016 Indian Institute of Technology Bombay 39

  • Dictionary Lookup ?

    Same word/phrase with different entity labels

    Example

    Mamata Banerjee eyes Tata booster shot despite Singur fight

    I was prosecuted to shield Tata: 2G accused Balwa

    It is difficult to collect the list of all named entities as new named

    entities

    6/22/2016 Indian Institute of Technology Bombay 40

  • Modelling for NER

    6/22/2016 Indian Institute of Technology Bombay 41

  • Most Frequent tag

    6/22/2016 Indian Institute of Technology Bombay 42

  • Most Frequent tag…..

    6/22/2016 Indian Institute of Technology Bombay 43

  • 6/22/2016 Indian Institute of Technology Bombay 44

  • 6/22/2016 Indian Institute of Technology Bombay 45

  • 6/22/2016 Indian Institute of Technology Bombay 46

  • 6/22/2016 Indian Institute of Technology Bombay 47

  • 6/22/2016 Indian Institute of Technology Bombay 48

  • 6/22/2016 Indian Institute of Technology Bombay 49

  • 6/22/2016 Indian Institute of Technology Bombay 50

  • 6/22/2016 Indian Institute of Technology Bombay 51

  • 6/22/2016 Indian Institute of Technology Bombay 52

  • 6/22/2016 Indian Institute of Technology Bombay 53

  • 6/22/2016 Indian Institute of Technology Bombay 54

    Greedy Inference

  • Summary

    • We began with an introduction to NER

    • Brief overview of Maximum entropy approach for NER

    • Currently, much of the community is looking towards Deep

    • Learning based approaches

    6/22/2016 Indian Institute of Technology Bombay 55

  • Thank you

    Questions ?