www.decideo.fr/bruley
Natural Language Natural Language ProcessingProcessing
June 2013
Michel Bruley
www.decideo.fr/bruley
Natural Language Processing Natural Language Processing (NLP)(NLP)
NLP is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language
NLP is considered as a sub-field of artificial intelligence and has significant overlap with the field of computational linguistics. It is concerned with the interactions between computers and human (natural) languages.
– Natural language generation systems convert information from computer databases into readable human language
– Natural language understanding systems convert human language into representations that are easier for computer programs to manipulate.
NLP encompasses both text and speech, but work on speech processing has evolved into a separate field
www.decideo.fr/bruley
Where does it fit in the CS* Where does it fit in the CS* taxonomy?taxonomy?
Computers
Artificial Intelligence AlgorithmsDatabases Networking
Robotics SearchNatural Language Processing
InformationRetrieval
Machine Translation
Language Analysis
Semantics Parsing* CS = Computer Science
www.decideo.fr/bruley
Why Natural Language Why Natural Language Processing?Processing?
Applications for processing large amounts of texts require NLP expertise
Classify text into categories, index and search large texts: Classify documents by topics, language, author, spam filtering, information retrieval (relevant, not relevant), sentiment classification (positive, negative)Extracting data from text: converting unstructured text into structure dataInformation extraction: discover names of people and events they participate in, from a document, …Automatic summarization: Condense 1 book into 1 page, …Speech processing, artificial voice: get flight information or book a hotel over the phone, …Question answering: find answers to natural language questions in a text collection or databaseSpelling & Grammar CorrectionsPlagiarism detectionAutomatic translationEtc.
www.decideo.fr/bruley
The problemThe problem
When people see text, they understand its meaning (by and large)
According to research, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is that the frist and lsat ltteer are in the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by islelf but the wrod as a wlohe.
When computers see text, they get only character strings (and perhaps HTML tags)
We'd like computer agents to see meanings and be able to intelligently process text
These desires have led to many proposals for structured, semantically marked up formats
But often human beings still resolutely make use of text in human languages
This problem isn’t likely to just go away
www.decideo.fr/bruley
Example: Natural language Example: Natural language understandingunderstanding
Raw speech signal
• Speech recognitionSequence of words spoken
• Syntactic analysis using knowledge of the grammarStructure of the sentence
• Semantic analysis using info. about meaning of wordsPartial representation of meaning of sentence
• Pragmatic analysis using info. about contextFinal representation of meaning of sentence
Natural language understanding process – Prof. Carolina Ruiz
www.decideo.fr/bruley
Example detail: Syntactic Example detail: Syntactic AnalysisAnalysis
The big cat is drinking milk
Noun Phrase Verb Phrase
Determiner Adjective Phrase
Noun Auxiliary Verb Noun Phrase
The big cat is drinking milk
• Syntactic analysis involves isolating phrases and sentences into a hierarchical structure, allowing the study of its constituents.
• For example the sentence “the big cat is drinking milk” can be broken up into the following constituents:
www.decideo.fr/bruley
Why NLP is difficultWhy NLP is difficult
Language is flexible– New words, new meanings – Different meanings in different contexts
Language is subtle– He arrived at the lecture– He chuckled at the lecture– He chuckled his way through the lecture– **He arrived his way through the lecture
Language is complex!
www.decideo.fr/bruley
Why NLP is difficultWhy NLP is difficult
MANY hidden variables
– Knowledge about the world
– Knowledge about the context
– Knowledge about human communication techniques
• Can you tell me the time?
Problem of scale
– Many (infinite?) possible words, meanings, context
Problem of sparsity
– Very difficult to do statistical analysis, most things (words, concepts) are never seen before
Long range correlations
www.decideo.fr/bruley
Why NLP is difficultWhy NLP is difficult
Key problems:
– Representation of meaning
– Language presupposes knowledge about the world
– Language only reflects the surface of meaning
– Language presupposes communication between people
www.decideo.fr/bruley
Patented Natural Language Processing Patented Natural Language Processing (NLP) “Reads” Every Communication(NLP) “Reads” Every Communication
Each data feed is parsed through one or more of the 7 NLP engines
…it is then deconstructed to provide context, subject, and other information regarding the customer (gender, name etc.)
Finally each identified customer is matched back to the Discovery platform data to gain a full view
Natural language processing (NLP) is the study of the interactions between computers and natural languages (e.g., English, Polish). The crucial challenge that NLP
addresses is in deriving meaning from human or natural language input and allowing consumers to analyze
parsed meanings in large volumes.
www.decideo.fr/bruley
For Example….For Example….
I bought an iPad2 for my mom last week. She loves the weight, but doesn’t like the color. She wishes it came
in blue. She says if it came in blue, then she’d buy one for all her friends
Entities (brands, people, locations, times, products…)Events and relationships (purchasing event, my mom…)Sentiment (product specifications)Suggestions (feature specifications)Intent (to purchase, to leave)Geo/Temporal
QUESTION: Why is this a big deal?
NLP takes a simple English statement, parses them into the categories above (and more categories) and VOILA…we got STRUCTURED DATA
www.decideo.fr/bruley
Aster
ASTER DISCOVERY PLATFORM
“Now-structured”
data
“Now-structured”
data
ArchitectureArchitecture
Customers / Sales / Other
data
Customers / Sales / Other
data
Churn ScoreSQL MR
Churn ScoreSQL MR
Attensity PipelineReal-time annotated social media data feed: 150+ million social and online sources
Other Unstructured Data
Emails; Surveys; CRM Notes….
Pipeline Connector
ASAS WrapperSQL MR
ASAS WrapperSQL MR
NLP
ETL
Visualization (e.g., Tableau,
MSTR)
Predictive
www.decideo.fr/bruley
This integration provides types, subtypes, super types (“Savings”, “Checking”, “Investment”)
Inclusion of the Anaphora: Connecting a subject (George Harrison) without repeating the full name (“He”, “Him”)
Includes other languages besides English
Attensity’s Semantic Annotation Server (ASAS) capabilities Entity Extraction: Automatic detection and extraction of more than 35 entities such as Name,
Place Uses Attensity Triples to create context on entities and identify verbs, relationships, actions Auto Classification: Uses custom classification rules to classify articles by content, sort by
relevance, and discovers repeated information Exhaustive Extraction: Application of linguistic principles to extract context, entities, and
relationships similar to how the human mind would Voice Tags: to identify types of statements and auto classify them (Question, Intent,
Conditional)
Creates a unique identifier for each entity for cross reference
Aster + Attensity = Competitive Aster + Attensity = Competitive AdvantageAdvantage
www.decideo.fr/bruley
Structuring Unstructured Data: Structuring Unstructured Data: Process FlowProcess Flow
The flight was delayed and flight attendant would not give us any new information.
www.decideo.fr/bruley
New Table: Customer Reactions
Database Record from a Customer Survey
date
10-02-06
region
0006
rec?
4
source
telephone
Why would you recommend/not recommend?The flight was delayed and flight attendant would
not give us any new information.
Who/Whatflight
Behaviordelay
Fact/Triple
flight : delaySame Record with Relational Facts
Extracted from Notes Field
date region source rec? who-what Behavior Fact/Triple
10-2-12 0006 telephone 4 flight delay flight : delay
10-2-12 0006 telephone 4 information give [not]information : give [not]
1-1-13 0007 e-mail 8 i happy [not] i : happy [not]
1-1-13 0007 e-mail 8 rep rude rep : rude
1-1-13 0007 e-mail 8 flight cancel flight : cancel
Original Structured DataNewly Structured DataProvided by Attensity
How Triples are Extracted & How Triples are Extracted & StructuredStructured
Extract Extract relational facts & Triples
from Notes field
Then FusePopulate new table with
attribute values and fuse with structured data.