Recent advances in Deep Learning for NLP

Recent advances in Deep Learning for NLPFrançois Yvon
Data Science “Summer” School
Palaiseau, January 7th, 2021
F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 1 / 39
Part I
What is Natural Language Processing?
Natural - Language - Processing
Processing is what machines do
Natural Language Processing is a scientific domain at the cross-roads of Linguistics, Computer Science, and more and more Machine
Learning
Natural - Language - Processing
Contemporary NLP ([Jurafsky and Martin, 2000]) The HAL 9000 computer in Stanley Kubrick’s film “2001: A Space Odyssey” is one of the most recognizable characters in twentieth-century cinema. HAL is an artificial agent capable of such advanced language- processing behavior as speaking and understanding English, and at a crucial moment in the plot, even reading lips. (...)
What would it take to create at least the language-related parts of HAL? Minimally, such an agent would have to be capable of interacting with humans via language, which includes understanding humans via speech recognition and natural language understanding (and of course lip-reading), and of communicating with humans via natural language generation and speech synthesis. HAL would also need to be able to do information retrieval (finding out where needed textual resources reside), information extraction (extracting pertinent facts from those textual resources), and inference (drawing conclusions based on known facts).
Although these problems are far from completely solved (...) solving these problems, and others like them, is the main concern of the fields known as natural language processing, computational linguistics and speech recognition and synthesis, which together we call Speech and Language Processing
Dissecting NLP
Dissecting NLP: language analysis (and generation) From the speech input
l@kuzEdøpols@pikEd@bjEbjEkonEtR@savil
cousin de Paul se piquait de bien connaître sa ville
Dissecting NLP
l@kuzEdøpols@pikEd@bjEø:d@bjEkonEtR@savil
Dissecting NLP
Dissecting NLP: language analysis (and generation) Automatic Speech recognition - lexical decoding
l@ { kuzE { dø { pol { s@ { pikE { d@ { bjE { ø:d@ {bjE { konEtR@ { sa { vil
Dissecting NLP
Dissecting NLP: language analysis (and generation) Automatic Speech recognition - orthographic decoding
le cousin de Paul se piquait de bien connaître sa ville
Dissecting NLP
Dissecting NLP: language analysis (and generation) Automatic Speech recognition - orthgraphic decoding
Le cousin de Paul se piquait de bien connaître sa ville .
Dissecting NLP
Dissecting NLP: language analysis (and generation) Automatic Speech recognition - text normalisation
Le cousin de Paul se piquait de bien connaître sa ville . DET N PRP N PRO VRB PRP ADV VRB POS N
Dissecting NLP
Dissecting NLP: language analysis (and generation) Shallow syntax - part-of-speech tagging
Le cousin de Paul se piquait de bien connaître sa ville . DET N PRP N PRO VRB PRP ADV VRB POS N ms ms - m - ii3s - - s— fs fs
Dissecting NLP
Dissecting NLP: language analysis (and generation) Shallow syntax - morphological features induction
Dissecting NLP
Dissecting NLP: language analysis (and generation) Syntax - dependency parsing
racine
det
de-obj
prep
nsubj
mod
prep
vobj
nobj
det
Dissecting NLP
Dissecting NLP: language analysis (and generation) Semantic parsing
racine
de-obj
nsubj
vobj
nobj
( sp / se-piquer-de-1 )
Dissecting NLP
Dissecting NLP: language analysis (and generation) Semantic parsing
de-obj
nsubj
vobj
nobj
( sp / se-piquer-de-1 :ARG0 (c / cousin) :ARG1 (k / connaitre-1) )
Dissecting NLP
Dissecting NLP: language analysis (and generation) Semantics: coreference resolution
de-obj
vobj
nobj
:ARG0 (c ) :ARG1 (v /ville))
) F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 6 / 39
Dissecting NLP
Dissecting NLP: language analysis (and generation) Semantics: word sense disambiguation
( sp / se-piquer-de-1 :ARG0 (c / cousin) :ARG1 (k / connaitre-2
:ARG0 (c ) :ARG1 (v /ville))
) cousin = ?
Dissecting NLP
Dissecting NLP: language analysis (and generation) From concepts to sound: text generation
Even harder: going all the way back
Dissecting NLP
Z Stanford NLP https://nlp.stanford.edu/projects/coref.shtml
Entity detection and linking
Dissecting NLP
Z Image from https:
//opendatahub.io/assets/img/posts/2019-08-21-sentiment-analysis-blog/sentiment_analysis_example.png
Relations between sentences: paraphrases, implications and contradictions
Dissecting NLP
Z Image from https://github.com/amir-zeldes/rstWeb/blob/master/gh-site/rstweb_structurer.png
Language analysis as compilation The pipeline model
Dissecting NLP
Language analysis as compilation From sounds to concepts via hidden structures
With arbitrarily long dependencies
Dissecting NLP
Appears at all levels: sound patterns, graphemes, word grouping, meaning, interpretation
Dissecting NLP
Language analysis as compilation A so unstable grammar: what is A language, after all?
Creativity, language change, and multilfactorial sources of variations
Dissecting NLP
Language analysis as compilation The pipeline model does not work
/ errors accumulate down the pipe
/ early decisions require a deep analysis
, not a defect, a feature, a infinite source of puns
/ilalasate/ : il a la santé or il alla sans thé (or sans T ) or il a l’as en T /løfisdyvwazEkiabit/: qui habite or qui habitent ? /das@filmZaduZaRdEepol/: Jean Dujardin [est / hait / et] Paul à force de trop fumer, il a fini par [casser sa pipe] ; il a [pris la porte] à l’envers
Part II
The great reduction: all you need is ... classification?
The advent of statistical NLP (90-) From [Church and Mercer, 1993]
The data intensive approach to language, which is becoming known as ’Text Analysis’ , takes a pragmatic view that is well suited to meet the recent emphasis on evaluations and concrete deliverables. Text Analysis focuses on a broad (though superficial) coverage of unrestricted text, rather than deep analysis of (artificially) restricted domain.
Z https://www.aclweb.org/anthology/J93-1000.pdf
Paul sees a man with an umbrella
npsuj
npobj
Reformulation: select the syntactic head of with : V ou N ?
sees with or man with ?
ppmod ppmod
Ambiguity Resolution: a decision problem Many important natural language inferences can be viewed as problems of resolving ambiguity, either semantic or syntactic, based on properties of the surrounding context [Roth, 1998]
The great reduction: there is nothing but classification problems [1992-]
1 turn problem P into binary decision x→ y ∈ {0, 1} 2 collect annotations D = {(xi, yi), i = 1 . . .N} 3 turn linguistic data into feature vector x ∈ Rp
4 choose parametric class of functions fθ Rp → {0, 1} in F 5 define error measure `(yi, fθ(xi)) (eg. I(yi ≠ fθ(xi)))
6 train fθ : θ∗ = argminθ `(θ) = ∑i `(yi, fθ(xi)) 7 report error rate E[yi ≠ fθ(xi)]
Steps 4-7 are standard ML
3 turn linguistic data into feature vector x ∈ Rp
Problem P : PP-attachement Given (PP, Verb, Noun) , head(PP) = V or head(PP) = N ? (with, sees, man) or (with, to see, man) , head(PP) ?= V | N
Possible features PP = ? Verb = ? Noun = ? Is Verb singular, plural ? does it end in -ing , what its tense ? its mood ? etc ? Is Noun singular ? plural ? abstract ? concrete ? animate ?
An endless, costly, knowledge intensive task; helped by feature selection, regularization, better training objectives
The jack of all trades
prepositional attachement : head = V ou head = N ?
spell checking : word = there ou word = their ?
semantic disambiguation: bank = BANK/1 ou BANK/2?
coreference resolution : (Mary, she) co-referent ?, is it referential ?
sentiment analysis : is text positive / negative / neutral, useful / useless ?
stance analysis : is this text arguing for / against / neutral ?
textual entailment : does e1 imply e2?
extractive summarization: is this information new ?
Also works for multiclass problems: PoS, Named-Entities, etc, etc.
Even generalizes to structure prediction: syntax, semantic parsing, etc (harder)
x= son petit masque la gêne
y1=
S
NP
Det
son
N
Adj
petit
N
masque
VP
Pro
la
V
gêne
y2 =
S
NP
Det
son
N
petit
VP
V
masque
NP
Det
la
N
gêne
A large subfield, beautiful masterpieces, from HMMs to PCFGs to struct-SVM and more [Smith, 2011]
Limitations & caveats: what remained difficult
defining / computing linguistic features
learning from unbalanced data distributions
computing [estimation / inference] with complex structural dependencies
mitigating annotation shortage
From a bird’s eye view: a divided field, with many task-specific techniques
Part III
The unreasonable effectiveness of deep representations
Improving NLP with neural models
# 1 improved features and classifiers for structured inputs
# 2 from closed vocabulary to infinite type models
# 3 pretraining and transfer learning
# 4 multilinguality in NLP
Improvement number #1 - from discrete from continuous contexts
Improved features and classifiers for structured inputs Encoding contexts, from feed-forward NN to RNNs to Transformer
Feed-forward networks for Language Models (2002-2012) [Bengio et al., 2003]
wi-1
wi-2
wi-3
R
R
R
, word representations are meaningful / interpretable
, implicit model smoothing
/ limited, fixed-size dependencies
Fast and cheap word embeddings with skip-gram (2013-2018)
Z From [Mikolov et al., 2013]
Recurrent networks (+GRUs, +LSTMs) (2011-2017)
Z Figure from [Torch RNN LM ]http://torch.ch/blog/2016/07/25/nce.html
, (deep) representation learned end2end
, (almost) arbitrarily long contexts
/ dependency order is hard-coded
]
Transformer networks (2017-)
, (almost) arbitrarily long contexts
, trained attention = trained dependencies
Transformer networks: the transformation
Three ways to encode local and global contexts
Improvement number #2 - from infinite to finite types
Getting rid of the infamous < unk >nown word
Closed world assumption
The support of word models: fixed-size vocabulary V . Sentences with unknowns have 0 probability.
The extended support : fixed vocabulary V ∪ { < unk >}. Estimation: all words /∈ V are unked [makes < unk > very likely].
Variant: consider classes of < unk > (proper names, numbers, etc).
Subword units: morphemes, char ngrams, etc
morph-based LM: require morphogical analysis, < unk > still happens
letters: no more unknown words - unknown symbols instead ?
the right thing: a mixture of words and char strings
Shorter units require longer histories [estimation problems], imply longer sentences [computational problems].
Getting rid of the infamous < unk >nown word
Closed world assumption
The support of word models: fixed-size vocabulary V . Sentences with unknowns have 0 probability.
The extended support : fixed vocabulary V ∪ { < unk >}. Estimation: all words /∈ V are unked [makes < unk > very likely].
Variant: consider classes of < unk > (proper names, numbers, etc).
Subword units: morphemes, char ngrams, etc
morph-based LM: require morphogical analysis, < unk > still happens
letters: no more unknown words - unknown symbols instead ?
the right thing: a mixture of words and char strings
Shorter units require longer histories [estimation problems], imply longer sentences [computational problems].
Subword units in language models: BPEs, wordpieces, etc
Byte Pair Encoding (BPE): N deterministic merge operations
1 Learn symbol map (greedy) Repeat till done: merge most frequent symbol pair (bigram) into a compound symbol
2 Encode (greedy) split each word into compound symbols
Example from [Sennrich et al., 2016]
L = { lower , lowest , newer , wider , wide }
e r# 3 [er#] [lo] w 2 [low] l o 2 [lo] w i 2 [wi]
Segmentations: [low]+ [er#], [low]+ e+ s+ [t#], n+ e+ w+ [er#], [wid]+ [er#], [wid]+ [e#] Z https://github.com/rsennrich/subword-nmt
Frequent (short) words remain atomic, rare (long) words are split
Multi-task learning & pre-training The overwhelming majority of these state-of-the-art systems address a benchmark task by
applying linear statistical models to adhoc features. In other words, there researchers themselves discover intermediate representations by engineering task-specific features. (...) Although such performance improvements can be very useful in practice, they teach us little about the means to progress toward the broader goals of natural language understanding and the elusive goals of Artificial Intelligence. In this contribution, we try to excel on multiple benchmarks while avoiding task-specific enginering. Instead we use a single learning system able to discover adequate internal representations. [Collobert et al., 2011]
A recipe for multi-task NLP
1 pretrain context-free or context-dependent word-embeddings on large “general domain” corpus with free supervision.
2 plug-in embeddings into (domain) specific task
3 resume training with a task-dependent loss
Improvement number #3 - pretraining and transfer learning
Multi-task learning & pre-training (...) to excel on multiple benchmarks while avoiding task-specific enginering. (...) a single
learning system able to discover adequate internal representations. [Collobert et al., 2011]
Popular pre-training architectures
Skip-gram [Mikolov et al., 2013] - Fasttext [Bojanowski et al., 2017]
ELMO [Peters et al., 2018] uses biRNNs at step 1, BERT [Devlin et al., 2019] and GPT-2/3 [Radford et al., 2019, Brown et al., 2020] use Transformers
ELMO and GPT-2 use half-contexts and a LM objective, BERT uses full context and two objectives: mask-LM and next sentence prediction
Bottom layer is char n-gram conv. layers + 2 highway layers + linear projection; top layers are bidirectional LSTMs, training objective predicts next word. All layers linearly combined to yield final representation.
Z Image from https://www.mihaileric.com/posts/deep-contextualized-word-representations-elmo/
BERT uses a Transformer architecture - Base implementation has 12-24 layers each with 12-16 heads. Z Image from https://jalammar.github.io/illustrated-bert/
Local and global representations Z Image from https://jalammar.github.io/illustrated-bert/
State-of-the-art in monolingual NLP circa 2019 [Devlin et al., 2019]
pre-train a generic contextual word representation
plug in representation into trained task-specific learner
get Bertified improvements
Z Tasks are here https://openai.com/blog/language-unsupervised/
GPT, a Transformer with “causal” self-attention, trained with next word prediction
The many benefits of LM pretraining
almost unsupervised learning - leverage huge monolingual corpora
solve rare “word” issue
Improve lexical / phrasal / sentential representations accross the board
Language model is all you need? Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. [Brown et al., 2020]
A (large) language model can perform many tasks
Improvement #4: multilinguality in NLP
Towards multilingual representations for multilingual NLP Transfering models accross languages
Supervised and unsupervised multilingual embeddings
Z Image from https://tinyurl.com/fb-multilingual-MT
(A) Learn two independent skip-gram models
(B) Find an optimal rotation, align anchor words (GANs, OT, etc)
(C) Refine alignment with Procustes
(D) Compute 1-1 word translations (with refinement step)
Supervised and unsupervised multilingual embeddings
Leaning multilingal contextual embeddings - XLM [Lample and Conneau, 2019]
Train with continuous stream of sentences
Do not use “next sentence prediction” objective
Train with language agnostic units and multiple languages
Z Images © A. Conneau & G. Lample (2018)
Leaning contextual multilingal embeddings - TLM
Learn a shared subword vocabulary
Train a single Transformer on MLM+TLM using parallel data (supervision)
Z Images © A. Conneau & G. Lample (2018)
“Zero-shot” X-lingual transfer
Train with annotated texts in English or high resource languages
Make predictions for texts in under resource languages
Part IV
Hard nuts, high hanging fruits?
Computing with symbols and structures
A token is not a character string, a sentence is not a word string
Structure as latent / unobserved variables
This is a multilingual world
(Standard) English sufficient to account for the world’s linguistic diversity and creativity?
Z https://www.ethnologue.com/guides/how-many-languages F. Yvon (LISN) Recent advances in DL 4 NLP Palaiseau - 2021-01-06 32 / 39
Computing with external knowledge and grounding Example by D. Hofstadter, translation by Gtranslate
In their house, everything comes in pairs. There’s his car and her car, his towels and her towels, and his library and hers.
Dans leur maison, tout vient en paires. Il y a sa voiture et sa voiture, ses serviettes et ses serviettes, sa bibliothèque et les siennes.
Recap
State of play useful yet unperfect tools for many tasks
require large annotated data and scalable techniques / models
a generic multi-purpose architecture: pre-training + Transformers + [ classifier | seq2seq] models
mostly superficial language analysis (if any)
Challenges ahead data shortage will not disappear
interfacing language with perception, knowledge and reasoning
there is much more than English
Recap
State of play useful yet unperfect tools for many tasks
require large annotated data and scalable techniques / models
a generic multi-purpose architecture: pre-training + Transformers + [ classifier | seq2seq] models
mostly superficial language analysis (if any)
Challenges ahead data shortage will not disappear
interfacing language with perception, knowledge and reasoning
there is much more than English
Part V
References
Bibliographie I
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003. ISSN 1532-4435.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017. ISSN 2307-387X.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. 2020. URL http://arxiv.org/pdf/2005.14165.
Kenneth W. Church. A pendulum swung too far. Linguistic Issues in Language Technology, 6 (8), 2011.
Kenneth W. Church and Robert L. Mercer. Introduction to computational linguistics special issue on large corpora. Computational Linguistics, 1(19):1–24, 1993.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/N19-1423.
Daniel Jurafsky and James H. Martin. Speech and Language Processing. Prentice Hall, 2000.
Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. CoRR, abs/1901.07291, 2019. URL http://arxiv.org/abs/1901.07291.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Proceedings of International Conference on Representation Learning, 2013.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/N18-1202.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019. URL https://openai.com/blog/better-language-models/.
Dan Roth. Learning to resolve natural language ambiguities: a unified approach. In Proceedings of the annual meeting of the American Association for Artificial Intelligence (AAAI), pages 806–813, Madison, WI, 1998.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. doi: 10.18653/v1/P16-1162. URL https://www.aclweb.org/anthology/P16-1162.
Noah A. Smith. Linguistic Structure Prediction. Synthesis Lectures on Human Language Technologies. Morgan and Claypool, May 2011.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
Dissecting NLP
The unreasonable effectiveness of deep representations
Hard nuts, high hanging fruits?
Recap
References
References

Documents

Recent advances in Deep Learning for NLP