Transcript
Page 1: Lexalytics Text Analytics Workshop: Perfect Text Analytics

Perfect Text AnalyticsSeth RedmoreVP, Product Management

Page 2: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 2

Perfect

per·fect

    [adj., n. pur-fikt; v. per-fekt]

1. conforming absolutely to the description or definition of an ideal type: a perfect sphere; a perfect gentleman.

2. excellent or complete beyond practical or theoretical improvement: There is no perfect legal code. The proportions of this temple are almost perfect.

Page 3: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 3

Text Analytics The term text analytics describes a set of linguistic

statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia)

In other words, enhancing the value of text content by extracting entities, features, context, relationships and emotion.

Page 4: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc.

Perfect is Fast Average Human Reading Speed:

250wpm Conservative computer reading

speed: 6000 wpm/core (our speed on a moderate single core)

Each core is equivalent to the reading bandwidth of 12 people.

Modern machines have 8 cores. That’s just about 100 people

in a box. Nice.

4

Page 5: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 5

Perfect is Useable “I don’t like the results” is not the same as “the results are

incorrect” Understanding the behavior key to usefulness Can you make better decisions? Can you make more money or save money? What is the most controversial area of text analytics? Thompson Reuters trading w/Sentiment Analysis increased

Alpha (profit over market) by 80 basis points

Page 6: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 6

Useable: How much can you differ? “In my shop, that up until now has relied exclusively on human coding, we consider anything

below 90% to be unacceptably inaccurate…. There is no doubt that automated sentiment is getting much much better, but to suggest that people should be okay with 20% of their data being wrong is just absurd.” Katie Delahaye Payne

Why is 10% “wrong” so much less absurd than 20% “wrong”?

20% Error 10% Error

Page 7: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc.

Perfect is Consistent Same results for same

content, every time University of Pittsburgh

“Multi-Perspective Question Answering” Corpus: 535 documents, 11k+ sentences.

40 hours of training for each rater

~80% inter-rater agreement

7

Page 8: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 8

Perfect is (new) Knowledge Discover the stuff you

don’t know Text Analytics is really, really

great at telling you the who, the what, and the where. Sometimes the “how”

You have to supply the “why” – but that question is way easier to answer when you know the other “w’s and the h”

Page 9: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 9

Perfect Includes Everything Running our top of the line

software flat out across one year will cost you about $.002/document analyzed (news article sized content) (assuming 3 docs/core-second, 8 core machine)

The more data the better and the greater worth your ta has

Page 10: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc.

Perfect is Trainable Can you solve YOUR business

problem with it? Can you optimize to suit

different kinds of content and roll those results up into a single reporting system?

10

Page 11: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc.

Perfect Text Analytics

11

FastUseableConsistentKnowledge(that is)

InclusiveTrainable

Page 12: Lexalytics Text Analytics Workshop: Perfect Text Analytics

Customer Snapshots(or, “rubber, meet road”)

Page 13: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 13

Reputation Management

Page 14: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 14

Politics

Page 15: Lexalytics Text Analytics Workshop: Perfect Text Analytics

15

Market Intelligence

Client Employee

Client CompanyWeb 2.0

CollaborationFIR

EW

AL

L

crawl, FTPor CD

SinglePoint

Integrated Index

External Content Providers

MI Analyst Text Analytics

Single Sign-on

Trashcan

Internal research

OptionalDocument Repository

Search Results

NL Search Engine

User Authentication

User Authentication

User Authentication

Custom Web Crawls & Gov.

Databases

SecondaryResearchSuppliers

News& Journals

Financial analyst reports

All right reserved © 2010 Lexalytics Inc.

Content Processing

InternalDocument Repository

Page 16: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 16

Hospitality

Page 17: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 17

Financial Services Turns News into numbers for automatic trading systems

Company stocks + Commodities

Resilient server product

Buy/Sell

Indicators

Indicators

Financial data

Ultimate customers are financial institutions QED (Quantitative and Event-Driven Trading) Banks, hedge funds.

JPMorgan, SocGen, Alpha Equities…and others

Algorithmic

Trading(QED firm)

RNSEServer

Page 18: Lexalytics Text Analytics Workshop: Perfect Text Analytics

ROI – Retrieving Organized Information

RTI CONSULTING SERVICES

REPEATABLEEVOLVINGDESIGNS

BALANCED METHODOLOGYBusiness AssessmentUser InterviewsTaxonomy Design and RecommendationContent Governance / Analysis

DEPLOYMENT / SUPPORTSolution AlternativesIntegration & DeploymentTesting, Tuning, and Evaluation

THOUGHT LEADERSHIPStrategy ConsultationRoadmaps – Evolution and Growth

PROF. TED SULLIVAN

Page 19: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 19

Pharma

Page 20: Lexalytics Text Analytics Workshop: Perfect Text Analytics

The Next Year…

Page 21: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 21

Opinion Mining Who said what about whom?

Clinton: N. Korea must face consequences over sinking

U.S. Secretary of State Hillary Clinton warned Friday that North Korea must face consequences over the alleged sinking of a South Korean warship which has stoked tensions in the divided peninsular.

A South Korean military report published this week claimed that the sinking of the Cheonan was caused by a North Korean torpedo attack.

Pyongyang denies that claim and said Friday that it could back out of a nonaggression pact between the neighbors if Seoul attempted to punish it over the sinking.

North Korea and South Korea have remained officially at war since an armistice in 1953 brought their three-year Cold War conflict to an end.

"I think it's important to send a clear message to North Korea that provocative actions have consequences," Clinton said Friday as she began a week-long Asian tour in Tokyo, Japan.

She said she was consulting with international allies to find the appropriate reaction.

Speaker Topic Sentiment

Pyongyang Seoul 0

nonaggression pact

0

Mike Mullen North Korea 0

present situation

0.021728

normal state

0

South Korea

-0.478279

Hillary Clinton North Korea 0

provocative actions

0

Hillary Clinton

0

clear message

0.6

North Korea

0.6

Page 22: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 22

Sarcasm, Twitter Model trained to detect sarcasm Once detected, you can decide what to do with it – because

actually determining the sentiment is going to be unreliable New model trained on Twitter content Moving towards a concept of text analytics driven by

business logic

Page 23: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 23

Thesaurus-based Theme RollupMachine generated conceptual taxonomyGas/Electric Hybrid and EV might roll up to EVFewer themes, but very useful to detect patterns across content

Page 24: Lexalytics Text Analytics Workshop: Perfect Text Analytics

24

Foreign Language Support French is first, followed by other Romance languages New stemmer New summarization algorithm New part-of-speech tagger Automatic language detection New sentiment/entity extraction algorithms

Also applicable to vertical specific content

Confidence scoring by algorithm

Use business logic to meld the results

All right reserved © 2010 Lexalytics Inc.

Page 25: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 25

Trainable Entity Sentiment New technique for entity sentiment Initial results from testing in English extremely

promising Average human scoring overlap of >> 90% for

scored sentences Initially used only for FrenchP(Human | Computer) Human Tagged

Computer Tagged Negative Neutral Positive Grand Total

Negative 100.00% 0.00% 0.00% 100.00%

Neutral 0.64% 98.29% 1.07% 100.00%

Positive 0.00% 6.67% 93.33% 100.00%

Grand Total 5.70% 88.2% 6.27% 100.00%

precision

Page 26: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 26

Tool Enhancements Entity Management Toolkit

Part of Speech Tagset trainingUsing to train Salience on French

Sentiment ToolkitBuild your own entity sentiment models:

French (first)

Eventually use on English content:

TwitterCustomer SatisfactionOthers…

Fully Tagged

DocumentDoc POS Tagger

New EMT helps us build a new French PoS tagger

New Sentiment Toolkit + Maximum Entropy model builder allows new

Entity and Sentiment modules

Themes&

Summaries

Entity Extraction

& Sentiment Models

Page 27: Lexalytics Text Analytics Workshop: Perfect Text Analytics

27

Business Logic + TA Algorithms

Content

A B C D

Finance$ Sports

SourceSearchBusiness LogicOther TA SystemSarcasm

POS 25

NEG 25

NEU 25

MIX 25

POS 60

NEG 10

NEU 20

MIX 10

POS 80

NEG 05

NEU 05

MIX 10

POS 50

NEG 20

NEU 30

MIX 0

Entity: Cisco

Route On

All right reserved © 2010 Lexalytics Inc.

Unknown?

Probability

Scores

Cisco : Positive

Page 28: Lexalytics Text Analytics Workshop: Perfect Text Analytics

All right reserved © 2010 Lexalytics Inc. 28

Summary Lots of people making money with text analytics In lots of different verticals Next 12 months brings online a whole host of features to

make our software even more flexible Check out tas.lexalytics.com Check out www.lexalytics.com/lexascope