73
Recent advances in Deep Learning for NLP François Yvon LISN — CNRS and Université Paris Saclay Data Science “Summer” School Palaiseau, January 7th, 2021 F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 1 / 39

Recent advances in Deep Learning for NLP

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Recent advances in Deep Learning for NLPFrançois Yvon
Data Science “Summer” School
Palaiseau, January 7th, 2021
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 1 / 39
Part I
What is Natural Language Processing?
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 2 / 39
Natural - Language - Processing
Processing is what machines do
Natural Language Processing is a scientific domain at the cross-roads of Linguistics, Computer Science, and more and more Machine
Learning
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 3 / 39
Natural - Language - Processing
Contemporary NLP ([Jurafsky and Martin, 2000]) The HAL 9000 computer in Stanley Kubrick’s film “2001: A Space Odyssey” is one of the most recognizable characters in twentieth-century cinema. HAL is an artificial agent capable of such advanced language- processing behavior as speaking and understanding English, and at a crucial moment in the plot, even reading lips. (...)
What would it take to create at least the language-related parts of HAL? Minimally, such an agent would have to be capable of interacting with humans via language, which includes understanding humans via speech recognition and natural language understanding (and of course lip-reading), and of communicating with humans via natural language generation and speech synthesis. HAL would also need to be able to do information retrieval (finding out where needed textual resources reside), information extraction (extracting pertinent facts from those textual resources), and inference (drawing conclusions based on known facts).
Although these problems are far from completely solved (...) solving these problems, and others like them, is the main concern of the fields known as natural language processing, computational linguistics and speech recognition and synthesis, which together we call Speech and Language Processing
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 3 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) From the speech input
l@kuzEdøpols@pikEd@bjEbjEkonEtR@savil
cousin de Paul se piquait de bien connaître sa ville
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
l@kuzEdøpols@pikEd@bjEø:d@bjEkonEtR@savil
cousin de Paul se piquait de bien connaître sa ville
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) Automatic Speech recognition - lexical decoding
l@ { kuzE { dø { pol { s@ { pikE { d@ { bjE { ø:d@ {bjE { konEtR@ { sa { vil
cousin de Paul se piquait de bien connaître sa ville
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) Automatic Speech recognition - orthographic decoding
l@ { kuzE { dø { pol { s@ { pikE { d@ { bjE { ø:d@ {bjE { konEtR@ { sa { vil
le cousin de Paul se piquait de bien connaître sa ville
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) Automatic Speech recognition - orthgraphic decoding
l@ { kuzE { dø { pol { s@ { pikE { d@ { bjE { ø:d@ {bjE { konEtR@ { sa { vil
Le cousin de Paul se piquait de bien connaître sa ville .
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) Automatic Speech recognition - text normalisation
l@ { kuzE { dø { pol { s@ { pikE { d@ { bjE { ø:d@ {bjE { konEtR@ { sa { vil
Le cousin de Paul se piquait de bien connaître sa ville . DET N PRP N PRO VRB PRP ADV VRB POS N
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) Shallow syntax - part-of-speech tagging
l@ { kuzE { dø { pol { s@ { pikE { d@ { bjE { ø:d@ {bjE { konEtR@ { sa { vil
Le cousin de Paul se piquait de bien connaître sa ville . DET N PRP N PRO VRB PRP ADV VRB POS N ms ms - m - ii3s - - s— fs fs
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) Shallow syntax - morphological features induction
l@ { kuzE { dø { pol { s@ { pikE { d@ { bjE { ø:d@ {bjE { konEtR@ { sa { vil
Le cousin de Paul se piquait de bien connaître sa ville .
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) Syntax - dependency parsing
l@ { kuzE { dø { pol { s@ { pikE { d@ { bjE { ø:d@ {bjE { konEtR@ { sa { vil
Le cousin de Paul se piquait de bien connaître sa ville .
racine
det
de-obj
prep
nsubj
mod
prep
vobj
nobj
det
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) Semantic parsing
l@ { kuzE { dø { pol { s@ { pikE { d@ { bjE { ø:d@ {bjE { konEtR@ { sa { vil
Le cousin de Paul se piquait de bien connaître sa ville .
racine
de-obj
nsubj
vobj
nobj
( sp / se-piquer-de-1 )
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) Semantic parsing
l@ { kuzE { dø { pol { s@ { pikE { d@ { bjE { ø:d@ {bjE { konEtR@ { sa { vil
Le cousin de Paul se piquait de bien connaître sa ville .
de-obj
nsubj
vobj
nobj
( sp / se-piquer-de-1 :ARG0 (c / cousin) :ARG1 (k / connaitre-1) )
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) Semantics: coreference resolution
l@ { kuzE { dø { pol { s@ { pikE { d@ { bjE { ø:d@ {bjE { konEtR@ { sa { vil
Le cousin de Paul se piquait de bien connaître sa ville .
de-obj
vobj
nobj
:ARG0 (c ) :ARG1 (v /ville))
) F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) Semantics: word sense disambiguation
l@ { kuzE { dø { pol { s@ { pikE { d@ { bjE { ø:d@ {bjE { konEtR@ { sa { vil
Le cousin de Paul se piquait de bien connaître sa ville .
( sp / se-piquer-de-1 :ARG0 (c / cousin) :ARG1 (k / connaitre-2
:ARG0 (c ) :ARG1 (v /ville))
) cousin = ?
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) From concepts to sound: text generation
Even harder: going all the way back
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 7 / 39
Dissecting NLP
Z Stanford NLP https://nlp.stanford.edu/projects/coref.shtml
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 8 / 39
Entity detection and linking
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 8 / 39
Dissecting NLP
Z Image from https:
//opendatahub.io/assets/img/posts/2019-08-21-sentiment-analysis-blog/sentiment_analysis_example.png
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 8 / 39
Relations between sentences: paraphrases, implications and contradictions
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 8 / 39
Dissecting NLP
Z Image from https://github.com/amir-zeldes/rstWeb/blob/master/gh-site/rstweb_structurer.png
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 8 / 39
Language analysis as compilation The pipeline model
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 9 / 39
Dissecting NLP
Language analysis as compilation From sounds to concepts via hidden structures
With arbitrarily long dependencies
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 9 / 39
Dissecting NLP
Appears at all levels: sound patterns, graphemes, word grouping, meaning, interpretation
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 9 / 39
Dissecting NLP
Language analysis as compilation A so unstable grammar: what is A language, after all?
Creativity, language change, and multilfactorial sources of variations
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 9 / 39
Dissecting NLP
Language analysis as compilation The pipeline model does not work
/ errors accumulate down the pipe
/ early decisions require a deep analysis
, not a defect, a feature, a infinite source of puns
/ilalasate/ : il a la santé or il alla sans thé (or sans T ) or il a l’as en T /løfisdyvwazEkiabit/: qui habite or qui habitent ? /das@filmZaduZaRdEepol/: Jean Dujardin [est / hait / et] Paul à force de trop fumer, il a fini par [casser sa pipe] ; il a [pris la porte] à l’envers
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 9 / 39
Part II
The great reduction: all you need is ... classification?
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 11 / 39
The advent of statistical NLP (90-) From [Church and Mercer, 1993]
The data intensive approach to language, which is becoming known as ’Text Analysis’ , takes a pragmatic view that is well suited to meet the recent emphasis on evaluations and concrete deliverables. Text Analysis focuses on a broad (though superficial) coverage of unrestricted text, rather than deep analysis of (artificially) restricted domain.
Z https://www.aclweb.org/anthology/J93-1000.pdf
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 12 / 39
Paul sees a man with an umbrella
npsuj
npobj
Reformulation: select the syntactic head of with : V ou N ?
sees with or man with ?
ppmod ppmod
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 18 / 39
Ambiguity Resolution: a decision problem Many important natural language inferences can be viewed as problems of resolving ambiguity, either semantic or syntactic, based on properties of the surrounding context [Roth, 1998]
The great reduction: there is nothing but classification problems [1992-]
1 turn problem P into binary decision x→ y ∈ {0, 1} 2 collect annotations D = {(xi, yi), i = 1 . . .N} 3 turn linguistic data into feature vector x ∈ Rp
4 choose parametric class of functions fθ Rp → {0, 1} in F 5 define error measure `(yi, fθ(xi)) (eg. I(yi ≠ fθ(xi)))
6 train fθ : θ∗ = argminθ `(θ) = ∑i `(yi, fθ(xi)) 7 report error rate E[yi ≠ fθ(xi)]
Steps 4-7 are standard ML
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 18 / 39
Ambiguity Resolution: a decision problem Many important natural language inferences can be viewed as problems of resolving ambiguity, either semantic or syntactic, based on properties of the surrounding context [Roth, 1998]
3 turn linguistic data into feature vector x ∈ Rp
Problem P : PP-attachement Given (PP, Verb, Noun) , head(PP) = V or head(PP) = N ? (with, sees, man) or (with, to see, man) , head(PP) ?= V | N
Possible features PP = ? Verb = ? Noun = ? Is Verb singular, plural ? does it end in -ing , what its tense ? its mood ? etc ? Is Noun singular ? plural ? abstract ? concrete ? animate ?
An endless, costly, knowledge intensive task; helped by feature selection, regularization, better training objectives
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 18 / 39
Ambiguity Resolution: a decision problem Many important natural language inferences can be viewed as problems of resolving ambiguity, either semantic or syntactic, based on properties of the surrounding context [Roth, 1998]
The jack of all trades
prepositional attachement : head = V ou head = N ?
spell checking : word = there ou word = their ?
semantic disambiguation: bank = BANK/1 ou BANK/2?
coreference resolution : (Mary, she) co-referent ?, is it referential ?
sentiment analysis : is text positive / negative / neutral, useful / useless ?
stance analysis : is this text arguing for / against / neutral ?
textual entailment : does e1 imply e2?
extractive summarization: is this information new ?
Also works for multiclass problems: PoS, Named-Entities, etc, etc.
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 18 / 39
Ambiguity Resolution: a decision problem Many important natural language inferences can be viewed as problems of resolving ambiguity, either semantic or syntactic, based on properties of the surrounding context [Roth, 1998]
Even generalizes to structure prediction: syntax, semantic parsing, etc (harder)
x= son petit masque la gêne
y1=
S
NP
Det
son
N
Adj
petit
N
masque
VP
Pro
la
V
gêne
y2 =
S
NP
Det
son
N
petit
VP
V
masque
NP
Det
la
N
gêne
A large subfield, beautiful masterpieces, from HMMs to PCFGs to struct-SVM and more [Smith, 2011]
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 18 / 39
Ambiguity Resolution: a decision problem Many important natural language inferences can be viewed as problems of resolving ambiguity, either semantic or syntactic, based on properties of the surrounding context [Roth, 1998]
Limitations & caveats: what remained difficult
defining / computing linguistic features
learning from unbalanced data distributions
computing [estimation / inference] with complex structural dependencies
mitigating annotation shortage
From a bird’s eye view: a divided field, with many task-specific techniques
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 18 / 39
Part III
The unreasonable effectiveness of deep representations
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 22 / 39
Improving NLP with neural models
# 1 improved features and classifiers for structured inputs
# 2 from closed vocabulary to infinite type models
# 3 pretraining and transfer learning
# 4 multilinguality in NLP
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 23 / 39
Improvement number #1 - from discrete from continuous contexts
Improved features and classifiers for structured inputs Encoding contexts, from feed-forward NN to RNNs to Transformer
Feed-forward networks for Language Models (2002-2012) [Bengio et al., 2003]
wi-1
wi-2
wi-3
R
R
R
, word representations are meaningful / interpretable
, implicit model smoothing
/ limited, fixed-size dependencies
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 24 / 39
Improvement number #1 - from discrete from continuous contexts
Improved features and classifiers for structured inputs Encoding contexts, from feed-forward NN to RNNs to Transformer
Fast and cheap word embeddings with skip-gram (2013-2018)
Z From [Mikolov et al., 2013]
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 24 / 39
Improvement number #1 - from discrete from continuous contexts
Improved features and classifiers for structured inputs Encoding contexts, from feed-forward NN to RNNs to Transformer
Recurrent networks (+GRUs, +LSTMs) (2011-2017)
Z Figure from [Torch RNN LM ]http://torch.ch/blog/2016/07/25/nce.html
, (deep) representation learned end2end
, (almost) arbitrarily long contexts
/ dependency order is hard-coded
]
Improved features and classifiers for structured inputs Encoding contexts, from feed-forward NN to RNNs to Transformer
Transformer networks (2017-)
, (almost) arbitrarily long contexts
, trained attention = trained dependencies
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 24 / 39
Improvement number #1 - from discrete from continuous contexts
Improved features and classifiers for structured inputs Encoding contexts, from feed-forward NN to RNNs to Transformer
Transformer networks: the transformation
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 24 / 39
Improvement number #1 - from discrete from continuous contexts
Improved features and classifiers for structured inputs Encoding contexts, from feed-forward NN to RNNs to Transformer
Three ways to encode local and global contexts
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 24 / 39
Improvement number #2 - from infinite to finite types
Getting rid of the infamous < unk >nown word
Closed world assumption
The support of word models: fixed-size vocabulary V . Sentences with unknowns have 0 probability.
The extended support : fixed vocabulary V ∪ { < unk >}. Estimation: all words /∈ V are unked [makes < unk > very likely].
Variant: consider classes of < unk > (proper names, numbers, etc).
Subword units: morphemes, char ngrams, etc
morph-based LM: require morphogical analysis, < unk > still happens
letters: no more unknown words - unknown symbols instead ?
the right thing: a mixture of words and char strings
Shorter units require longer histories [estimation problems], imply longer sentences [computational problems].
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 25 / 39
Improvement number #2 - from infinite to finite types
Getting rid of the infamous < unk >nown word
Closed world assumption
The support of word models: fixed-size vocabulary V . Sentences with unknowns have 0 probability.
The extended support : fixed vocabulary V ∪ { < unk >}. Estimation: all words /∈ V are unked [makes < unk > very likely].
Variant: consider classes of < unk > (proper names, numbers, etc).
Subword units: morphemes, char ngrams, etc
morph-based LM: require morphogical analysis, < unk > still happens
letters: no more unknown words - unknown symbols instead ?
the right thing: a mixture of words and char strings
Shorter units require longer histories [estimation problems], imply longer sentences [computational problems].
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 25 / 39
Improvement number #2 - from infinite to finite types
Subword units in language models: BPEs, wordpieces, etc
Byte Pair Encoding (BPE): N deterministic merge operations
1 Learn symbol map (greedy) Repeat till done: merge most frequent symbol pair (bigram) into a compound symbol
2 Encode (greedy) split each word into compound symbols
Example from [Sennrich et al., 2016]
L = { lower , lowest , newer , wider , wide }
e r# 3 [er#] [lo] w 2 [low] l o 2 [lo] w i 2 [wi]
Segmentations: [low]+ [er#], [low]+ e+ s+ [t#], n+ e+ w+ [er#], [wid]+ [er#], [wid]+ [e#] Z https://github.com/rsennrich/subword-nmt
Frequent (short) words remain atomic, rare (long) words are split
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 26 / 39
Multi-task learning & pre-training The overwhelming majority of these state-of-the-art systems address a benchmark task by
applying linear statistical models to adhoc features. In other words, there researchers themselves discover intermediate representations by engineering task-specific features. (...) Although such performance improvements can be very useful in practice, they teach us little about the means to progress toward the broader goals of natural language understanding and the elusive goals of Artificial Intelligence. In this contribution, we try to excel on multiple benchmarks while avoiding task-specific enginering. Instead we use a single learning system able to discover adequate internal representations. [Collobert et al., 2011]
A recipe for multi-task NLP
1 pretrain context-free or context-dependent word-embeddings on large “general domain” corpus with free supervision.
2 plug-in embeddings into (domain) specific task
3 resume training with a task-dependent loss
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 27 / 39
Improvement number #3 - pretraining and transfer learning
Multi-task learning & pre-training (...) to excel on multiple benchmarks while avoiding task-specific enginering. (...) a single
learning system able to discover adequate internal representations. [Collobert et al., 2011]
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 27 / 39
Improvement number #3 - pretraining and transfer learning
Multi-task learning & pre-training (...) to excel on multiple benchmarks while avoiding task-specific enginering. (...) a single
learning system able to discover adequate internal representations. [Collobert et al., 2011]
Popular pre-training architectures
Skip-gram [Mikolov et al., 2013] - Fasttext [Bojanowski et al., 2017]
ELMO [Peters et al., 2018] uses biRNNs at step 1, BERT [Devlin et al., 2019] and GPT-2/3 [Radford et al., 2019, Brown et al., 2020] use Transformers
ELMO and GPT-2 use half-contexts and a LM objective, BERT uses full context and two objectives: mask-LM and next sentence prediction
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 27 / 39
Improvement number #3 - pretraining and transfer learning
Multi-task learning & pre-training (...) to excel on multiple benchmarks while avoiding task-specific enginering. (...) a single
learning system able to discover adequate internal representations. [Collobert et al., 2011]
Bottom layer is char n-gram conv. layers + 2 highway layers + linear projection; top layers are bidirectional LSTMs, training objective predicts next word. All layers linearly combined to yield final representation.
Z Image from https://www.mihaileric.com/posts/deep-contextualized-word-representations-elmo/
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 27 / 39
Multi-task learning & pre-training (...) to excel on multiple benchmarks while avoiding task-specific enginering. (...) a single
learning system able to discover adequate internal representations. [Collobert et al., 2011]
BERT uses a Transformer architecture - Base implementation has 12-24 layers each with 12-16 heads. Z Image from https://jalammar.github.io/illustrated-bert/
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 27 / 39
Multi-task learning & pre-training (...) to excel on multiple benchmarks while avoiding task-specific enginering. (...) a single
learning system able to discover adequate internal representations. [Collobert et al., 2011]
Local and global representations Z Image from https://jalammar.github.io/illustrated-bert/
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 27 / 39
Multi-task learning & pre-training (...) to excel on multiple benchmarks while avoiding task-specific enginering. (...) a single
learning system able to discover adequate internal representations. [Collobert et al., 2011]
State-of-the-art in monolingual NLP circa 2019 [Devlin et al., 2019]
pre-train a generic contextual word representation
plug in representation into trained task-specific learner
get Bertified improvements
Z Tasks are here https://openai.com/blog/language-unsupervised/
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 27 / 39
Multi-task learning & pre-training (...) to excel on multiple benchmarks while avoiding task-specific enginering. (...) a single
learning system able to discover adequate internal representations. [Collobert et al., 2011]
GPT, a Transformer with “causal” self-attention, trained with next word prediction
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 27 / 39
Improvement number #3 - pretraining and transfer learning
Multi-task learning & pre-training (...) to excel on multiple benchmarks while avoiding task-specific enginering. (...) a single
learning system able to discover adequate internal representations. [Collobert et al., 2011]
The many benefits of LM pretraining
almost unsupervised learning - leverage huge monolingual corpora
solve rare “word” issue
Improve lexical / phrasal / sentential representations accross the board
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 27 / 39
Improvement number #3 - pretraining and transfer learning
Language model is all you need? Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. [Brown et al., 2020]
A (large) language model can perform many tasks
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 28 / 39
Improvement #4: multilinguality in NLP
Towards multilingual representations for multilingual NLP Transfering models accross languages
Supervised and unsupervised multilingual embeddings
Z Image from https://tinyurl.com/fb-multilingual-MT
(A) Learn two independent skip-gram models
(B) Find an optimal rotation, align anchor words (GANs, OT, etc)
(C) Refine alignment with Procustes
(D) Compute 1-1 word translations (with refinement step)
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 29 / 39
Towards multilingual representations for multilingual NLP Transfering models accross languages
Supervised and unsupervised multilingual embeddings
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 29 / 39
Improvement #4: multilinguality in NLP
Towards multilingual representations for multilingual NLP Transfering models accross languages
Leaning multilingal contextual embeddings - XLM [Lample and Conneau, 2019]
Train with continuous stream of sentences
Do not use “next sentence prediction” objective
Train with language agnostic units and multiple languages
Z Images © A. Conneau & G. Lample (2018)
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 29 / 39
Improvement #4: multilinguality in NLP
Towards multilingual representations for multilingual NLP Transfering models accross languages
Leaning contextual multilingal embeddings - TLM
Learn a shared subword vocabulary
Train a single Transformer on MLM+TLM using parallel data (supervision)
Z Images © A. Conneau & G. Lample (2018)
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 29 / 39
Improvement #4: multilinguality in NLP
Towards multilingual representations for multilingual NLP Transfering models accross languages
“Zero-shot” X-lingual transfer
Train with annotated texts in English or high resource languages
Make predictions for texts in under resource languages
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 29 / 39
Part IV
Hard nuts, high hanging fruits?
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 30 / 39
Computing with symbols and structures
A token is not a character string, a sentence is not a word string
Structure as latent / unobserved variables
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 31 / 39
This is a multilingual world
(Standard) English sufficient to account for the world’s linguistic diversity and creativity?
Z https://www.ethnologue.com/guides/how-many-languages F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 32 / 39
Computing with external knowledge and grounding Example by D. Hofstadter, translation by Gtranslate
In their house, everything comes in pairs. There’s his car and her car, his towels and her towels, and his library and hers.
Dans leur maison, tout vient en paires. Il y a sa voiture et sa voiture, ses serviettes et ses serviettes, sa bibliothèque et les siennes.
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 33 / 39
Recap
State of play useful yet unperfect tools for many tasks
require large annotated data and scalable techniques / models
a generic multi-purpose architecture: pre-training + Transformers + [ classifier | seq2seq] models
mostly superficial language analysis (if any)
Challenges ahead data shortage will not disappear
interfacing language with perception, knowledge and reasoning
there is much more than English
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 34 / 39
Recap
State of play useful yet unperfect tools for many tasks
require large annotated data and scalable techniques / models
a generic multi-purpose architecture: pre-training + Transformers + [ classifier | seq2seq] models
mostly superficial language analysis (if any)
Challenges ahead data shortage will not disappear
interfacing language with perception, knowledge and reasoning
there is much more than English
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 34 / 39
Part V
References
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 35 / 39
Bibliographie I
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003. ISSN 1532-4435.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017. ISSN 2307-387X.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. 2020. URL http://arxiv.org/pdf/2005.14165.
Kenneth W. Church. A pendulum swung too far. Linguistic Issues in Language Technology, 6 (8), 2011.
Kenneth W. Church and Robert L. Mercer. Introduction to computational linguistics special issue on large corpora. Computational Linguistics, 1(19):1–24, 1993.
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 36 / 39
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/N19-1423.
Daniel Jurafsky and James H. Martin. Speech and Language Processing. Prentice Hall, 2000.
Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. CoRR, abs/1901.07291, 2019. URL http://arxiv.org/abs/1901.07291.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Proceedings of International Conference on Representation Learning, 2013.
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 37 / 39
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/N18-1202.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019. URL https://openai.com/blog/better-language-models/.
Dan Roth. Learning to resolve natural language ambiguities: a unified approach. In Proceedings of the annual meeting of the American Association for Artificial Intelligence (AAAI), pages 806–813, Madison, WI, 1998.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. doi: 10.18653/v1/P16-1162. URL https://www.aclweb.org/anthology/P16-1162.
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 38 / 39
Noah A. Smith. Linguistic Structure Prediction. Synthesis Lectures on Human Language Technologies. Morgan and Claypool, May 2011.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 39 / 39
Dissecting NLP
The unreasonable effectiveness of deep representations
Improvement number #1 - from discrete from continuous contexts
Improvement number #2 - from infinite to finite types
Improvement number #3 - pretraining and transfer learning
Improvement #4: multilinguality in NLP
Hard nuts, high hanging fruits?
Recap
References
References