68
Meltwater Budapest, April 2016 The importance of entities Babak Rasolzadeh, Director of Data Science Research

Babak Rasolzadeh: The importance of entities

Embed Size (px)

Citation preview

Page 1: Babak Rasolzadeh: The importance of entities

Meltwater Budapest, April 2016

The importance of entities

Babak Rasolzadeh, Director of Data Science Research

Page 2: Babak Rasolzadeh: The importance of entities

1. Company background2. Data Science @ Meltwater3. Challenges with NLP at Large scale4. Entities, entities, entities

a. Social NERb. ELSc. Knowledge Graph

Page 3: Babak Rasolzadeh: The importance of entities

3

What is Meltwater?

● A business intelligence company → Providing insights from data outside the firewall (news, blogs, social media, etc.)

● Born in Oslo, in 2001.● Founder and CEO: Jorn Lyssegen● www.meltwater.com

● 30K+ clients all over the World.● 1000+ employees● 60+ offices around the world, mostly sale.● Tech offices: USA, Germany, Sweden, Hungary, India.

Page 4: Babak Rasolzadeh: The importance of entities

4

Why?

own brand

competitors

leads

partners product reviews

own industry

Page 5: Babak Rasolzadeh: The importance of entities

5

What?

Uses Meltwater to find out about new instances of vandalism and break-ins. Often, the victim is in need of services

Uses Meltwater to help determine how public perception of certain

ingredient chemicals will influence adoption & sales

Uses Meltwater to be alerted of when certain patent will expire in

target markets

Uses Meltwater to monitor the performance and popularity of news

anchors and programs

Uses Meltwater social listening to estimate and prevent

infrastructure attacks

Page 6: Babak Rasolzadeh: The importance of entities

6

How?

Page 7: Babak Rasolzadeh: The importance of entities

UnstructuredDocument Stream

Pipeline

Enrichments

Search/Storage

Enriched Documents

High PerformanceIndexes

ProcessingServices

API Layer

APPSBackup Storage

Raw Documents

15 supported languages in pipeline(EN, DE, SV, NO, FI, ZH, JP, FR, ES, DA, NL, PT, AR, IT, HI)

Typical enrichments○ Sentiment analysis○ Thematic analysis○ Categorization○ Keyphrase extraction○ Named Entity Recognition○ Named Entity Disambiguation

NLP & Data Science at Meltwater

Page 8: Babak Rasolzadeh: The importance of entities

8

What other than NLP?

● Recommendation Engines

DOC3DOC3

DOC3

DOC3DOC3

DOC8

Realtime recommenderengine

● Correlation and predictive pattern recognition

● Word2vec techniques

concept 3concept 1

concept 2

“British American Tobacco" or "British American Tobbaco" or (BAT near tobacco) or "英美煙草" or (("Lucky Strike" or "Dunhill" or "Pall Mall") near/15 cigarette*)

Page 9: Babak Rasolzadeh: The importance of entities

9

Machine Learning Terminology

Page 10: Babak Rasolzadeh: The importance of entities

10

Challenges with Data Science (NLP) at scale

• High DPS (~2000) and a lot to do! (tokenization, lemmatization, stemming, POS tagging, categorization, sentiment, NER, ...) with racing conditions!

Pipeline

Enrichments

SV

EN

DE

POS NER• Training data labelling is costly! x15• Contextual information expensive (computationally).• Noise, missing data, variation (e.g. slang), data types, ...

Page 11: Babak Rasolzadeh: The importance of entities

Knowledge Base StrategyEntities, entities, entities

don - July 2015

Page 12: Babak Rasolzadeh: The importance of entities

12

Knowledge Base StrategyWhat are Named Entities (NE)?

● Non-linguistic definition○ Referable entities○ Usually Proper Names○ Single or multi-word

→ I know this man. He might be Charles.→ He lives in Stockholm. He is Swedish.

Page 13: Babak Rasolzadeh: The importance of entities

13

Knowledge Base StrategyWhat is Named Entity Recognition (NER)?

1. Extracting NEs from a text.2. Categorizing NEs from a set of predefined categories.

John lives in Stockholm. He works at Ericsson.

Categories of {PER, LOC, ORG, MISC, PROD}

Page 14: Babak Rasolzadeh: The importance of entities

14

Knowledge Base StrategyWhat NER is not?

● NER is not event recognition. ● NER recognises entities in text, and classifies them in some way, but it

does not create templates, nor does it perform co-reference or entity linking.

● NER is not just matching text strings with pre-defined lists of names. It only recognises entities which are being used as entities in a given context. (i.e. not easy!)

Page 15: Babak Rasolzadeh: The importance of entities

15

● Key part of Information Extraction system● Robust handling of proper names essential for many applications● Pre-processing for different classification levels● Information filtering● Information linking● Entity level sentiment● Knowledge graph

Why NER?

Page 16: Babak Rasolzadeh: The importance of entities

16

Knowledge Base StrategyWhy NER?

Page 17: Babak Rasolzadeh: The importance of entities

17

Knowledge Base StrategyWhy NER?

Pepsi spooks Coke with this Halloween themed ad.

Entity specific sentiment analysis a.k.a ELS

Page 18: Babak Rasolzadeh: The importance of entities

Knowledge Base StrategySo what about Social…?

Page 19: Babak Rasolzadeh: The importance of entities

19

Supervised Learning

❏ Hidden Markov Model (HMM) Freitag and Mccallum, 1999; Leek, 1997.

❏ Conditional Markov Model (CMM) Borthwick, 1999; McCallum et al., 2000.

✓ Conditional Random Field (CRF) Lafferty, 2001; Ratinov and Roth, 2009.

How to do NER? (state-of-the-art)

Page 20: Babak Rasolzadeh: The importance of entities

20

● Ground truth data collection for NER is very expensive● Solutions:

○ Automatic NER annotation using Wikipedia data○ Applying Latent Dirichlet Analysis (LDA) based NER detection

using Gazetteer data.

Training data

Page 21: Babak Rasolzadeh: The importance of entities

21

NER pipeline

Page 22: Babak Rasolzadeh: The importance of entities

22

Extensive lists of names for a specific category● PER

○ First names (male-female) and surnames, their frequency● LOC

○ Cities, Countries○ Population

● ORG○ Name of companies from Yellow pages.

Gazetteers help

Disadvantages○ Difficult to create and maintain (or expensive if commercial)○ Usefulness varies depending on category ○ Ambiguity○ Words occur in more lists of different types (PER, LOC, FAC,...)

Page 23: Babak Rasolzadeh: The importance of entities

23

Let’s say we want to estimate the likelihood of the bi-gram "to Shanghai", without having seen this in a training set.

The system can obtain a good estimate if it can cluster "Shanghai" with other city names (like “London”, “Beijing”), then make its estimate based on the likelihood of phrases such as "to London", "to Beijing" and "to Denver"

Brown clustering - motivation

Page 24: Babak Rasolzadeh: The importance of entities

24

● Proposed by Brown et al. (1992) (a.k.a “IBM clustering”)● Hierarchical class-based labeling method.● Bottom-up● Unsupervised learning

○ Doesn't need labeled data but rather large set of raw text.● Greedy technique to maximize bi-gram MI.● Merge words by contextual similarity.

Brown clustering (1)

( )

Page 25: Babak Rasolzadeh: The importance of entities

25

Brown clustering (2)

● Large amount of data○ Similar words appear in similar contexts.○ More precisely: similar words have similar distribution of words to their

immediate left and right.● Example: “the” and “a” both are determinant.

○ Frequency of immediate words on their left and right:

Page 26: Babak Rasolzadeh: The importance of entities

26

Brown clustering (3)

Page 27: Babak Rasolzadeh: The importance of entities

27

Hmm...easy?

● What are the challenges in real applications?● What about moving to other languages?● What about moving to social domain?

Page 28: Babak Rasolzadeh: The importance of entities

28

Disambiguation

What is the entity category of “Washington”?

Page 29: Babak Rasolzadeh: The importance of entities

29

Different languages

● Tokenization○ Chinese & Japanese: Words not separated

● Part of speech○ Nouns

■ English: only number inflection■ German: number, gender and case inflection

○ Verbs■ English: regular verb 4, irregular verb up to 8 distinct forms■ Finnish: more than 10,000 forms

● NER: Shape feature○ English: Only proper nouns capitalized○ German: All nouns capitalized

Page 30: Babak Rasolzadeh: The importance of entities

30

Different languages

Page 31: Babak Rasolzadeh: The importance of entities

31

Different languages

Studying of linguistic properties of a language is important!

Page 32: Babak Rasolzadeh: The importance of entities

32

Editorial vs. Social

Page 33: Babak Rasolzadeh: The importance of entities

33

Challenges in Social NER

● The performance of “off-the-shelf” NER methods degrades severely when applied on Twitter data

● Tweets○ are short: 140 character limit.○ cover wide range of topics.○ are written grammatically in broken language. ○ are written fast and posted from anywhere: a lot of mis-spelling.

→ a solution which considers social characteristics of text

Page 34: Babak Rasolzadeh: The importance of entities

34

Challenges in Social NERExamples of noisy data● Jaguar's gonna like this episode of #MadMen even less than last week's, I bet.● Dane Bowers is in Asda I cant believe.it luckiest girl in the world omf i cant believe

it omg● A feel good story RT @DailyBreezeNews: Santa Claus arrives by helicopter at LAX

to greet local school

Page 35: Babak Rasolzadeh: The importance of entities

35

Solution (1)Adapting existing features to social properties(POS tagger of editorial NER performs really poor

when it comes to social documents.)

Page 36: Babak Rasolzadeh: The importance of entities

36

Solution (2)

Weight (importance) of each CRF feature

Page 37: Babak Rasolzadeh: The importance of entities

37

Results

● Training Data○ ~76K tweets labeled by human

annotator.

● Inter agreement of two annotators.

● Test Data○ ~9.1K tweets labeled by human

annotator.

● Improvement compared state-of-the-art methodRitter, A. et al. Named entity recognition in tweets: An experimental study. EMNLP ’11, pages 1524–1534.

Page 38: Babak Rasolzadeh: The importance of entities

Knowledge Base StrategyWhat about sentiment….?

Page 39: Babak Rasolzadeh: The importance of entities

Document Level Sentiment - how it works

Inter-annotator agreement ~80%*

* http://bit.ly/human-sentiment

Page 40: Babak Rasolzadeh: The importance of entities

Document Level Sentiment - how it works

Machine Learning MagicSupervised learningNaive bayes - BernoulliNB, GaussianNB, MultinomialNBSupport Vector Machines - LinearSVM, RbfSVMMaximum Entropy Model - GIS, IIS, MEGAM, TADMMLP - RecurrentNN

Page 41: Babak Rasolzadeh: The importance of entities

Document Level Sentiment - how it works

Machine Learning Magic

Page 42: Babak Rasolzadeh: The importance of entities

Document Level Sentiment - current status

~60-70% (depending on language)

Not too terrible, considering that human performance is at best ~80%...

...but why is it so hard?

Page 43: Babak Rasolzadeh: The importance of entities

Document Level Sentiment - how it’s used

Page 44: Babak Rasolzadeh: The importance of entities

Document Level Sentiment - how it’s used

Page 45: Babak Rasolzadeh: The importance of entities

Document Level Sentiment - the problem

Page 46: Babak Rasolzadeh: The importance of entities

Document Level Sentiment - the problem

Negative

Neutral

Page 47: Babak Rasolzadeh: The importance of entities

Document Level Sentiment - the problem

“Those numbers underline a growing gap between McDonald's and today's fast-food customers. It will only get wider with another year's worth of the same uninspired fare that has made McDonald's customers easy pickings for Panera Bread, Chick-fil-A, Chipotle Mexican Grill and others.

Negative

Positive

Does not make sense for our industry!

Page 48: Babak Rasolzadeh: The importance of entities

Knowledge Base StrategyEntity Level Sentiment (ELS)

Page 49: Babak Rasolzadeh: The importance of entities

Entity Level Sentiment - motivation

● DLS imprecise and wrong for our customers● Entities are of main importance for our customers ● We already have NER (Named Entity Recognition) technology

Idea:

Identify the sentiment towards each particular entity in a text!

Page 50: Babak Rasolzadeh: The importance of entities

Entity Level Sentiment - how it works

NER

BMW: PositiveMercedes: NeutralToyota: Negative…

Page 51: Babak Rasolzadeh: The importance of entities

Entity Level Sentiment - how it works

Entity1: PositiveEntity2: NeutralEntity3: Negative…

E1:PositiveE2: NeutralE3: Negative

E1:PositiveE2: NeutralE3: Negative

E1:PositiveE2: NeutralE3: Negative

Page 52: Babak Rasolzadeh: The importance of entities

Entity Level Sentiment - how it works

Entity1: PositiveEntity2: NeutralEntity3: Negative…

NER

Page 53: Babak Rasolzadeh: The importance of entities

Entity Level Sentiment - use case

Page 54: Babak Rasolzadeh: The importance of entities

Entity Level Sentiment - current status

● ELS is considered a very tough problem in NLP/ML● The accuracy of state-of-the-art ELS is currently very low

(~45%)

Page 55: Babak Rasolzadeh: The importance of entities

Knowledge Base StrategyThe holy grail : The Graph Knowledge Base

don - July 2015

Page 56: Babak Rasolzadeh: The importance of entities

56

Entities + Relationships

As the types of entities and their relationships grows so does the capacity to infer insights that depend on connectivity and eventually one can answer questions thatwould otherwise not bepossible with only separate datasets!

Page 57: Babak Rasolzadeh: The importance of entities

57

KB ArchitectureUnstructuredDocument Stream

Pipeline

Enrichments

Graph Search

Enriched Documents

High PerformanceIndexes

ProcessingServices

API Layer

KnowledgeBase (Graph)

I/O

External Data Providers

Updates/subscriptions

Lookups

APPSBackup Storage

Raw Documents

Page 58: Babak Rasolzadeh: The importance of entities

Knowledge Base StrategyWhy is it hard?

Page 59: Babak Rasolzadeh: The importance of entities

59

Composing the KB

Page 60: Babak Rasolzadeh: The importance of entities

60

Data Acquisition trade-offs

High volume

High quality

Chea

pManual data acquisition

Special crawlers,Smart algorithms

Acquisitions, partnerships

low quality

expensivelow

volume

Page 61: Babak Rasolzadeh: The importance of entities

61

Composing the KB - Scalability

Page 62: Babak Rasolzadeh: The importance of entities

62

Scalability Requirements - next stepsCompanies ~ 100 million worldwidePeople ~ 500 million (including media influencers)Products ~ 500 million

~1 billion entities all the connections between them

→ billions of nodes, trillions of edges!

Page 63: Babak Rasolzadeh: The importance of entities

63

Composing the KB - New features

Page 64: Babak Rasolzadeh: The importance of entities

64

Improve entity search - company NED

Page 65: Babak Rasolzadeh: The importance of entities

65

Improve entity search - person NED

Robert Gates22nd Secretary of Defense

William Henry Gates IIIformer CEO & cofounder of Microsoft

“Who is Mr. Gates?”

Page 66: Babak Rasolzadeh: The importance of entities

66

Emerging competition

Page 67: Babak Rasolzadeh: The importance of entities

67

Map influencer network

influencer score ~ eg. PageRank

Page 68: Babak Rasolzadeh: The importance of entities

68

Suggested read

● Ratinov 2009 (challenges in NER): paper.● ArkCMU (social): paper, code.● Ritter et al (social): paper, code.● Stanford NLP NER (editorial): paper, code. ● Brown clustering

○ brown clustering: video ○ Word Representations and N-grams: video

● Transforming Wikipedia into Named Entity Training Data: paper.