Upload
others
View
24
Download
0
Embed Size (px)
Citation preview
A Primer onNatural Language Processing
Mohammad Taher Pilehvar
TeIAS Summer School on Data Science
26 August 2019
Artificial Intelligence
Design algorithms that make computers behave intelligently
But, what is intelligent behavior?
Image from threatpost.com
Artificial IntelligenceScenario 1: Vision
Artificial IntelligenceScenario 1: Vision
Artificial IntelligenceScenario 1: Vision
To a non-intelligent computer, photos are nothing but sets of colored pixels
Artificial Intelligence
Scenario 1: Vision (face detection/recognition)
Artificial Intelligence
Scenario 1: Vision (autonomous cars)
Artificial Intelligence
Scenario 2: Motion/Manipulation (Robotics)
Artificial Intelligence
Scenario 3: Learning/Planning
Artificial Intelligence
Scenario 4: Natural language!
??!!
What’s the capital of Iran?
Artificial Intelligence
Khatam
01001011 01101000 01110100 01101000 01101101
K h a t a m
Scenario 4: Natural language!
Artificial Intelligence
Scenario 4: Natural language!
Make computers
understand and
generate natural
language
Natural Language Processing(Computational Linguistics)
NLP ML AI
Natural Language Processing(Computational Linguistics)
Natural Language Processing(Computational Linguistics)
Natural Language Understanding Natural Language Generation
*
Difficulties of Language Understanding
Difficulty of Language Understanding
Common sense knowledge
• The trophy would not fit in the brown suitcase because it is too big.• What is too big?
Difficulty of Language Understanding
Common sense knowledge
• The trophy would not fit in the brown suitcase because it is too big.• What is too big?
• The town councilors refused to give the demonstrators a permit because they feared (advocated) violence. • Who feared (advocated) violence?
Difficulty of Language Understanding
Context
“It is raining outside. This is the reason why I won't go out”.
• What is the reason to not go outside?• This?
Coreference resolution:• I did not vote for Donald Trump because I think he is a lier!
Anaphora resolution:• I bought a new Thinkpad, I have an old Macbook. I am going to give it away!
Difficulty of Language Understanding
Slang, idioms and sarcasm
• Those shoes are goat; She is busted; He is rather a frenemy
• In a nutshell; piece of cake; think outside the box; bad apple; get the picture
• That’s just what I needed today!(When something bad happens)
Difficulty of Language Understanding
Ambiguity
Illustration from IBM Watson
Difficulty of Language Understanding
Ambiguity
Difficulty of Language Understanding
Ambiguity
Difficulty of Language Understanding
Ambiguity
Difficulty of Language Understanding
Ambiguity
Difficulty of Language Understanding
Amazon fire!
Difficulty of Language Understanding
Metonymic Ambiguity
• London voted to stay in the EU
• The White House admits Trump is lying to manipulate his voters
• The kettle is boiling
• Iran beat Cuba after dropping first two sets
Difficulty of Language Understanding
Syntactic Ambiguity
I heard his cell phone ring in my office
WiC (Word-in-Context) dataset(Pilehvar and Collados, 2019, nominated for IJCAI’s research excellence award)
Label Target Context-1 Context-2
False bedThere's a lot of trash on the bed of the river
I keep a glass of water next to my bed when I sleep
Label Target Context-1 Context-2
False bedThere's a lot of trash on the bed of the river
I keep a glass of water next to my bed when I sleep
False landThe pilot managed to land the airplane safely
The enemy landed several of our aircrafts
Label Target Context-1 Context-2
False bedThere's a lot of trash on the bed of the river
I keep a glass of water next to my bed when I sleep
False landThe pilot managed to land the airplane safely
The enemy landed several of our aircrafts
True air Air pollutionOpen a window and let in some air
Label Target Context-1 Context-2
False bedThere's a lot of trash on the bed of the river
I keep a glass of water next to my bed when I sleep
False landThe pilot managed to land the airplane safely
The enemy landed several of our aircrafts
True air Air pollutionOpen a window and let in some air
True windowThe expanded window will give us time to catch the thieves
You have a two-hour window of clear weather to finish working on the lawn
WiC (Word-in-Context) dataset
Team System Accuracy
Google BERT++ 69.9
Facebook AI RoBERTa 69.6
Stanford Hazy Research Snorkel 72.1
Performance upperbound -- 80.0
Difficulty of Language Generation
Massive vocabulary size
Dynamic word order
Syntax and grammar
Fluency
Natural Language Processing(Computational Linguistics)
Applications of NLP
*
Machine Translation
Information Retrieval
Document Summarisation
Question Answering
Plagiarism Detection
Document Classification
Spam Detection
Fake News Detection
Chatbots
Social Media Analysis
Sentiment Analysis
Social Media Analysis
Tip of the Tongue (ToT)
Reverse dictionary
NLP and Deep Learning
Source: XenonStack
*
NLP and Deep Learning
Word Sense Disambiguation
NLP and Deep Learning
Word Sense Disambiguation
Conventional approach
Extract (hand-crafted) features:
• Surrounding words
• Part of speech tags
• Collocations
NLP and Deep Learning
Word Sense Disambiguation
DL-based approach
• End-to-end model
• Input words, output classes
• No features involved
Figure from Kågebäck and Salomonsson (2016)
NLP and Deep Learning
Sentence Similarity Measurement
Figure from Google AI blog
NLP and Deep Learning
Sentence Similarity Measurement
Conventional approach
Extract features:
• String-based: if their words look similar (phone vs. telephone)
• Semantic: if their words have similar meanings (dozens of individual techniques)
• Style: ratio of function words, if they have overlapping numbers
• Phonetic: if they sound similar
• …
NLP and Deep Learning
Sentence Similarity Measurement
DL-based approach
Figure from Mueller and Thyagarajan (2016)
NLP and Deep Learning
Stance detection
Gibraltar source says the Iranian tanker Grace-1 will be allowed to leave
Agree: Iran says Britain might release seized Grace 1 oil tanker soon
Disagree: Iranian tanker continues to be detained by Gibraltar
NLP and Deep Learning
Stance detection
Conventional approach
Extract (hand-crafted) features:
• Word overlaps
• Word frequencies
• Count features
• …
NLP and Deep Learning
Stance detection
DL-based approachEnd-to-end
NLP and Deep Learning
Word embeddings (2013)
Khatam pizza
desk
rain
NLP and Deep Learning
Word embeddings (2013)
train
rail
station
passenger
railway
bus
terminal
transit
flower
fruit
treeseed
leaf
university
education
library
studies
NLP and Deep Learning
Word embeddings (2013)
NLP and Deep Learning
Word embeddings (2013)
NLP and deep learning
Contextualised Models (since 2018)
A new turning point in NLP
Evolving very rapidly
2013 2014 2015 2016 2017 2018 2019 2020
Word2vec
GloVe
ELMo
GPTBERT
XLNet
ULMFit GPT-2
RoBERTa
NLP and deep learning
Contextualised Models
One system for all tasks!
Natural Language Processing
Main Current Research Challenges
*
Existing challenges in NLP
Natural Language Understanding
• Learning language from the ground up
• Innate biases vs. learning from scratch
• Linguistics, cognitive and neuroscience aspects
• Reasoning
Existing challenges in NLP
NLP for low-resource languages
• Lack of data, for training and for evaluation
• Incentives
• Universal language models
• Cross-lingual representations
Existing challenges in NLP
Reasoning at scale
Current NLP is unable to analyze large or multiple documents
A challenging task:
• NarrativeQA: questions about entire movie scripts and books
Existing challenges in NLP
Evaluation
Current evaluation benchmarks and performance metrics often themselves need re-evaluation!
• Machine Translation
• Dialogue
• Language Generation
Language Modeling
Language Model
Language Modeling
Language Model
Language ModelingPersian poetry
https://www.darbare.com/Post/30084
مثنوی مولویشاهنامه فردوسی
Language ModelingWikipedia articles
http://karpathy.github.io
Language ModelingWikipedia articles
http://karpathy.github.io
Language ModelingXML
http://karpathy.github.io
Language ModelingScientific article
Generative modelsSunspring
Generative modelsSunspring
Thanks
Up next:
Michael Zock on Tip of the tongue problem!
MZ is not a computer scientist, but a psycholinguist working (now for decades) on languageproduction.
His goal lies in the building of computational tools to help people to speak and to write be it themother tongue, or a foreign language. To achieve his goal he relies on knowledge from psychology(psycholinguistics + neuroscience) and engineering skills (NLP).
Those who are interested in more details may take a look at his website:
http://pageperso.lif.univ-mrs.fr/~michael.zock/