Salient named entity identification system

Preview:

Citation preview

SALIENT NAMED ENTITIES IDENTIFICATION FROM TWEET

INFORMATION RETRIEVAL AND EXTRACTION APRIL 2015

Course Instructor : Dr. Vasudeva VarmaMentor : Priya R priya.r@research.iiit.ac.in

Ganesh J ganesh.j@research.iiit.ac.inSnehit Chiluveru snehit.chiluveru@students.iiit.ac.in

Sindhura Y. R. sindhura.yr@students.iiit.ac.in

Agenda

Problem Definition Approach Dataset Creation Inter-annotator Agreement Prediction algorithm Prediction performance Future directions Conclusion

What are named entities ?

Named entities (NE) are the phrases that clearly identifies one item from a set of other items that have similar attributes*.

They generally fall into 4 categories – Name, Place, Organization and Event.

* - Definition from http://searchbusinessanalytics.techtarget.com/definition/named-entity

Examples of NE’s

NE type Example

Name Virat Kohli, Sundar Pichai

Place Delhi, California

Organization

Indian cricket team, Google

Event 2015 Cricket World Cup

What are Salient NE’s ?

Salient NE’s (SNE) are more central to the text. capture the author’s intention. are relatively important.

Basically, SNE’s are the keywords from a tweet, when these keywords are named entities.

Examples of SNE’s

Vicky went to house-warming ceremony of his professor Raj Reddy at Shankerpally,

Telangana.

Person A1 can mark [‘Vicky’, ‘Raj Reddy’] as SNE’s.

Person A2 can mark [‘Raj Reddy’] as SNE’s.

Person A3 can mark [‘Vicky’, ‘Raj Reddy’, ‘Shankerpally’, ‘Telangana’] as SNE’s.

Of course, all are valid annotations.

Thus, SNE annotation is highly subjective in nature.

Problem Definition

Given a tweet, identify SNE’s present in it, if any.

Novel problem. Applications

User modeling – Understand what user is talking about.

Predicting current trends – Till now done based on hashtags only.

Approach

Create a dataset with manually annotated SNE’s.

Show the inter-annotator agreement in annotating SNE’s.

Pose the problem as Sequence learning problem.

Build the prediction algorithm with good set of features.

Dataset creation

Consider the CWC15* tweet,

Possible SNE combinations

1. Sangakkara and Sachin Tendulkar

2. Sangakkara

3. Sachin Tendulkar

* - http://en.wikipedia.org/wiki/2015_Cricket_World_Cup

Dataset creation (contd.)

Now consider the entire tweet (which includes image)

The person in the picture is Sangakkara*

* - http://en.wikipedia.org/wiki/Kumar_Sangakkara

Intuition -Tweeter ‘s picture actually captures their focus.

Dataset creation (contd.)

Idea – If a NE in the tweet is in the image, then it’s a SNE.

Pipeline:

Collect tweets Identify NE’s Annotate SNE

Dataset creation (contd.)

Collect Tweets Use Twitter4j API to get all the tweets

with the Hashtags* corresponding to Cricket World Cup 2015 quarter finals.

Obtain tweets which are in English language only. not re-tweets. accompanied by at-least one image.

* - #NZvWI,#WIvNZ,#PakvsAus,#AUSvPAK,#INDvsBAN,#BANvsIND,#SAvSL, #SLvSA

Collect tweets

Identify NE’s

Annotate SNE

Dataset creation (contd.)

Identify NE’s Used 3 state-of-art* Named Entity

Rec0gnizers (NER) for tweets “as is”.

Combined all their results to improve recall.

* - Analysis of Named Entity Recognition and Linking for Tweets - Information Processing & Management 51 (2), 32-49, 2014

Collect tweets

Identify NE’s

Annotate SNE

Dataset creation (contd.)

Identify NE’s NER’s used:

University of Washington – Alan Ritter (https://github.com/aritter/twitter_nlp)

Carnegie Mellon University – ArkTweet (http://www.ark.cs.cmu.edu/TweetNLP/)

Stanford (http://nlp.stanford.edu/software/CRF-NER.shtml)

Collect tweets

Identify NE’s

Annotate SNE

Dataset creation (contd.)

Annotate SNE’s Created a web application using

GWT* to do the manual annotation with ease.

Removed tweets with criteria (manually): Pointless Sarcasm Duplicate (same text but not a re-tweet) Images with non-english text. Advertisements.

* - http://www.gwtproject.org/

Collect tweets

Identify NE’s

Annotate SNE

Dataset creation (contd.)Manual annotation interface

Inter-annotator agreement

Compute the inter-annotator agreement score for the annotated dataset.

Created a new dataset for three different domains using keywords - AppleWatch, SAvsNZ and NationalAwards.

Randomly sampled 20 tweets from each domain. (60 in total)

Asked 3 annotators to annotate the 60 tweet corpus.

Inter-annotator agreement (contd.)

Measure / Domain

All Apple Watch

SA vs NZ

National Awards

AgreementPercentage

0.78 0.73 0.85 0.75

Cohen Kappa* 0.68 0.62 0.57 0.76

Fleiss Kappa* 0.60 0.75 0.40 0.65

* - http://en.wikipedia.org/wiki/Cohen%27s_kappa and http://en.wikipedia.org/wiki/Fleiss'_kappa

Status

Problem Definition Approach Dataset Creation Inter-annotator Agreement SNE Prediction algorithm SNE Prediction performance Future directions Conclusion

Sequence Labeling Algorithm Variant of classification problem. Given a sequence (in NLP, words),

assign appropriate labels to each word.

Example, partial parsing (aka chunking):

For a token, the target label is also dependent on the features of adjacent tokens.

B-NP I-NP B-VPB-PP B-NP I-NPThe cat sat on the mat

SNE Prediction algorithm

Modeled the SNE identification problem as Sequence learning problem.

Used Conditional Random Fields (CRF)* algorithm.

Used Alan Ritter’s Twitter NLP toolkit* to tokenize the tweets so as to extract features.

* - http://python-crfsuite.readthedocs.org/en/latest/ and https://github.com/aritter/twitter_nlp

SNE Prediction algorithm

Linguistic Features (from Alan Ritter) Word – Actual token POS tag - NNP, VBP, PRP, ... Chunk POS tag - B-NP, I-NP, B-VP, I-VP,… Entity tag - B-ENTITY, I-ENTITY and O

Example

Target Label – B-SNE, I-SNE and O-SNE

Word POS Tag Chunk POS Tag

Entity tag

Sachin NNP B-NP B-ENTITY

Tendulkar NNP I-NP I-ENTITY

SNE Prediction algorithm

CRF FeaturesType Feature Weigh

t

Word Lower : Change the case of word to lower case.

3

Word Upper : Change the case of word to upper case.

1

Word isTitle : Python’s default str.isTitle function.

1

Word isUpper : Is the word in upper case. 2

Word isFirstCharHash : True if first character is ‘#’

3

Word isFirstCharHashOrAt : True if first character is ‘#’ or ‘@’

4

Word isFirstCharCaps : True if first character is in uppercase.

3

SNE Prediction algorithm

CRF FeaturesType Feature Weigh

t

POS Postag : POS tag returned by Ritter 4

POS isStartsWithNN : True if pos tag starts with ‘NN’

2

POS isStartsWithNNorPR : True if pos tag starts with ‘NN’ or ‘PR’

1

Chunk

Chunk : Chunk POS tag returned by Ritter 1

Chunk

isChunkNP : True if chunk pos tag is ‘B-NP’ or ‘I-NP’

3

Entity

Entity : Entity tag returned by Ritter 4

Entity

isEntity : True if entity is B-ENTITY or I-ENTITY

1

SNE Prediction algorithm

CRF Features - Word2Vec* Algorithm that captures context of

words in a form of vector. Trained using all the tweets related

to World Cup 2015 (about 2,50,000 in size)

Example vector for a word ‘Sangakkara’[-0.014, -0.135, … , -0.068]

* - http://deeplearning4j.org/word2vec.html

SNE Prediction performance Dataset Size = 1100 5-fold cross validation Window size = 5

Precision Recall F1-Score

B-SNE 0.67 0.48 0.56

I-SNE 0.57 0.35 0.43

O-SNE 0.93 0.97 0.95

Overall 0.90 0.91 0.90

Future directions

Improve CRF algorithm with better features.

Implement other classifiers HMM, Neural networks, …

Named Entity Linking Map the SNE to a Knowledge base (KB)

entry. For example, a SNE like ‘Kumar Sangakkara’ must be mapped to http://en.wikipedia.org/wiki/Kumar_Sangakkara

Conclusion

SNE identification problem is a new problem, with many potential applications (especially social network analysis).

Thank you

Resources Code -

https://github.com/ganeshaspiring/ire-seimp

PPT – To be uploaded in Slideshare.

Recommended