Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
A Mathematical Exploration of Language Models
Nikunj SaunshiPrinceton University
Center of Mathematical Sciences and Applications, Harvard University10th February 2021
Language Models
LanguageModel
Contextđ = âI went to the cafĂŠ and ordered aâ
Distributionđâ |#
0.3âŚ0.2
0.05
0.0001
đâ |#(âlatteâ)
đâ |#(âbagelâ)
đâ |#(âdolphinâ)
Next word predictionFor context đ , predict what word đ¤ would follow it
Cross-entropy objectiveAssign high đâ |# đ¤ for observed (đ , đ¤) pairs
đź(#,&) â log đâ |# đ¤(đ , đ¤) pairs
Unlabeled dataGenerate (đ , đ¤) pairs using sentences
Success of Language Models
LanguageModel
Distribution: đâ |#Context: đ
Architecture: TransformerParameters: 175 B
Architecture: TransformerParameters: 1542 M
Architecture: RNNParameters: 24 M
Train using Cross-entropy
Text Generation Question Answering
Machine Translation
Downstream tasks
Sentence Classification
I bought coffee
J'ai achetĂŠ du cafĂŠ
âScienceâ vs âPoliticsâThe capital of Spain is __
It was bright sunny day in âŚâŚâŚâŚ..
Main QuestionWhy should solving the next word prediction task help solve
seemingly unrelated downstream tasks with very little labeled data?
Rest of the talk
More general framework of âsolving Task A helps with Task Bâ
Our results for Language Models based on recent paperA Mathematical Exploration of Why Language Models Help Solve Downstream Tasks
Saunshi, Malladi, Arora, To Appear in ICLR 2021
Solving Task A helps with Task B
Solving Task A helps with Task B
⢠Humans can use the âexperienceâ and âskillsâ acquired from Task A to for new Task B efficiently
Language ModelingTask A: Next word predictionTask B: Downstream NLP task
â
Ride Bicycle Ride Motorcycle
â
Get a Math degree
Do well inlaw school later
â
Do basic chores Excel at Karate
The Karate Kid
⢠Adapted in Machine Learning
⢠More data efficient than supervised learning⢠Requires fewer labeled samples than solving Task B from scratch using supervised learning
Initialize a model andfine-tune using labeled data
Extract features and learn a classifier using labeled data
Pretrain Model on Task AStage 1
Use Model on Task BStage 2
Other innovative ways of using pretrained model
Solving Task A helps with Task BLanguage Modeling
Task A: Next word predictionTask B: Downstream NLP task
⢠Transfer learning⢠Task A: Large supervised learning problem (ImageNet)⢠Task B: Object detection, Disease detection using X-ray images
⢠Meta-learning⢠Task A: Many small tasks related to Task B⢠Task B: Related tasks (classify characters from a new language)
⢠Self-supervised learning (e.g. language modeling)⢠Task A: Constructed using unlabeled data ⢠Task B: Downstream tasks of interest
Requires some labeled data in Task A
Requires only unlabeled data in Task A
Solving Task A helps with Task B
âThis is the single most important problem to solve in AI todayâ- Yann LeCun
https://www.wsj.com/articles/facebook-ai-chief-pushes-the-technologys-limits-11597334361
Self-Supervised Learning
Motivated by following observations⢠Humans learn by observing/interacting with the world, without explicit supervision⢠Supervised learning with labels is successful, but human annotations can be expensive⢠Unlabeled data is available in abundance and is cheap to obtain
⢠Many practical algorithms following this principle do well on standard benchmarks, sometimes beating even supervised learning!
PrincipleUse unlabeled data to generate labels and construct supervised learning tasks
Self-Supervised Learning
Examples in practice⢠Images
⢠Predict color of image from b/w version⢠Reconstructing part of image from the rest of it⢠Predict the rotation applied to an image
⢠Text⢠Making representation of consecutive sentences in Wikipedia close⢠Next word prediction⢠Fill in the multiple blanks in a sentence
Task A: Constructed from unlabeled dataTask B: Downstream task of interest
Just need raw images
Just need a large text corpus
Theory for Self-Supervised Learning
⢠We have very little mathematical understanding of this important problem.
⢠Theory can potentially help⢠Formalize notions of âskill learningâ from tasks⢠Ground existing intuitions in math⢠Give new insights that can improve/design practical algorithms
⢠Existing theoretical frameworks fail to capture this setting⢠Task A and Task B are very different ⢠Task A is agnostic to Task B
⢠We try to gain some understanding for one such method: language modeling
A Mathematical Exploration of Why Language Models Help Solve Downstream TasksSaunshi, Malladi, Arora, To Appear in ICLR 2021
Theory for Language ModelsTask A: Next word predictionTask B: Downstream NLP task
LanguageModel
Distribution over words:đâ |#
Context:đ = âI went to the cafĂŠ and ordered aâ
Why should solving the next word prediction task help solve seemingly unrelated downstream tasks with very little labeled data?
Stage 1 Pretrain Language
Model using on Next Word Prediction
Stage 2 Use LanguageModel for Downstream
Task
Theoretical setting
Representation LearningPerspective
Role of task & objective
Sentence classification
â Extract features from LM, learn linear classifiers: effective, data-efficient, can do math
â Finetuning: Hard to quantify its benefit using current deep learning theory
â Why next word prediction (w/ cross-entropy objective) intrinsically helps
â Inductive biases of architecture/algorithm: current tools are insufficient
â First-cut analysis. Already gives interesting insights
â Other NLP tasks (question answering, etc.)
LanguageModel
Distribution: đâ |#Context: đ
What aspects of pretraining help?
What are downstream tasks?
How to use a pretrained model?
Theoretical setting
Why can language models that do well on cross-entropy objective learn featuresthat are useful for linear classification tasks?
LanguageModel
Distribution: đâ |#Context: đ
Extract đ-dim features đ(đ )
đ
âIt was an utter waste of time.â
âI would recommend this movie.â
âNegativeâ
âPositiveâ
Result overview
Key idea
Formalization
Verification
Classifications tasks can be rephrased as sentence completion problems, thus making next word prediction a meaningful pretraining task
Show that LM that is đ-optimal in cross-entropy learn featuresthat linearly solve such tasks up to đŞ đ
Experimentally verify theoretical insights (also design a new objective function)
Why can language models that do well on cross-entropyobjective learn features that are useful for linear
classification tasks?
Outline
⢠Language modeling⢠Cross-entropy and Softmax-parametrized LMs
⢠Downstream tasks⢠Sentence completion reformulation
⢠Formal guarantees⢠đ-optimal LM âđŞ đ -good on task
⢠Extensions, discussions and future work
Outline
⢠Language modeling⢠Cross-entropy and Softmax-parametrized LMs
⢠Downstream tasks⢠Sentence completion reformulation
⢠Formal guarantees⢠đ-optimal LM âđŞ đ -good on task
⢠Extensions, discussions and future work
Language Modeling: Cross-entropy
LanguageModel
Predicted dist.đâ |# â â$
Contextđ = âI went to the cafĂŠ and ordered aâ
0.3âŚ0.2
0.05
0.0001
0.35âŚ0.18
0.047
0.00005
True dist.đâ |#â â â$
â()*+ đâ |# = đź#,& â log đâ |# đ¤
Optimal solution: Minimizer of â&'() đâ |# is đâ |# = đâ |#â
Proof: Can rewrite as â&'() đâ |# = đź# đžđż đâ |#â , đâ |# + đścross-entropy
Samples fromWhat does the best language model learn?
(minimizer of cross-entropy)
Language Modeling: Softmax
0.35âŚ0.18
0.047
0.00005
True dist.đâ |#â â â$
min,,-
â()*+ đ,(#)Optimal solution: For fixed ÎŚ, đâ that minimizes â&'()satisfies ÎŚđ*â(#) = ÎŚđâ |#â
Can we still learn đ*(#) = đâ |#â exactly when đ < đ?
Language Model
Context: đ
0.3âŚ0.2
0.05
0.0001
đ đ â âđ softmax
on đ˝/đ(s)đ˝ â âđ Ăđ˝
Softmax dist.đ*(#) â â$
Word embeddingsFeatures
Proof: Use first-order condition (gradient = 0)â- đžđż đâ |#â , đ- = âÎŚđ- +ÎŚđâ |#â
Only guaranteed to learn đâ |#â on the đ-dimensional subspace spanned by đ˝
0.1â0.521.230.04
â0.3 ⯠0.2⎠⹠âŽ0.8 ⯠â0.1
LMs with cross-entropy aims to learn đâ |Eâ
Softmax LMs with word embeddings ÎŚ can only be guaranteed to learn ÎŚđâ |Eâ
Outline
⢠Language modeling⢠Cross-entropy and Softmax-parametrized LMs
⢠Downstream tasks⢠Sentence completion reformulation
⢠Formal guarantees⢠đ-optimal LM âđŞ đ -good on task
⢠Extensions, discussions and future work
Classification task â Sentence completion
⢠Binary classification task đŻ. E.g. {(âI would recommend this movie.â, +1), âŚ, (âIt was an utter waste of time.â, -1)}
⢠Language models aim to learn đâ |#â (or on subspace). Can đâ |#â even help solve đŻ
I would recommend this movie. ___
đâ |#â J â đâ |#â L > 0
+1, . . , â1, ⌠, 0 /
đâ |#â JâŚ
đâ |#â LâŚ
đâ |#â âđâđâ
> 0
đâ |#â JâŚ
đâ |#â LâŚ
đâ |#â âđâđâ
đŁ% đâ |#â > 0Linear classifier over đâ |#â
Classification task â Sentence completion
I would recommend this movie. ___đâ |#â JâŚ
đâ |#â LâŚ
đâ |#â âđâđâ
đâ |#â
Classification task â Sentence completion
I would recommend this movie. This movie was ___
đâ |#â (âđđđđâ)đâ |#â (âđđđđâ)
đâ |#â (âđđđđđđđđđĄâ)âŚ
đâ |#â (âđđđđđđđâ)đâ |#â (âđđđđđđâ)đâ |#â (ââđđđđâ)
Prompt
I would recommend this movie. ___
đâ |#â
024âŚâ3â20
đŁ
đŁ% đâ |#â > 0
Allows for a larger set of words that are grammatically correct completions
Extendable to other classification tasks (e.g., topic classification)
Experimental verification
⢠Verify sentence completion intuition (đâ |Eâ can solve a task)⢠Task: SST is movie review sentiment classification task⢠Learn linear classifier on subset of words on đ&(#) from pretrained LM*
With prompt:âThis movie isâ *Used GPT-2 (117M parameters)
đ đ&(#)(đ words)
đ&(#)(~ 20 words)
đ đ (768 dim)
đ)*)+ đ (768 dim)
Bag-of-words
SST 2 76.4 78.2 87.6 58.1 80.7
SST* 2 79.4 83.5 89.5 56.7 -
đ*(#) J
đ*(#) L
đ!(#) âđđđđâđ!(#) âđđđđđĄâ
âŚđ!(#) âđđđđđđâđ!(#) âđđđâ
Features from LM
Features from random init LM
non-LM baseline
Classification tasks can be rephrased as sentence completion problems
This is the same as solving the task using a linear classifier on đâ |Eâ , i.e. đŁJđâ |Eâ > 0
Outline
⢠Language modeling⢠Cross-entropy and Softmax-parametrized LMs
⢠Downstream tasks⢠Sentence completion reformulation
⢠Formal guarantees⢠đ-optimal LM âđŞ đ -good on task
⢠Extensions, discussions and future work
Natural task
minKâđŻ đâ |Eâ , đŁ ⤠đđ - Natural task đŻ
đ captures how ânaturalâ (amenable to sentence completion reformulation) the classification task is
Sentence completion reformulationâ Can solve using đŁ%đâ |#â > 0
For any đˇ-dim feature map đ and classifier đŁâđŻ đ đ , đŁ = đź(#,.)[logistic-loss đŁ%đ(đ ), đŚ ]
đ
âIt was an utter waste of time.â
âI would recommend this movie.â
âNegativeâ
âPositiveâ
Main Result
Naturalness of task(sentence completion view)
âđŻ ÎŚđ: ; ⤠đ + đŞ đ
Logistic regression loss of đ-dimensional features ÎŚđ& #
1. đ is a LM that is đ â optimal in cross-entropy (does well on next word prediction)
2. đŻ is a đ â natural task (fits sentence completion view)
3. Word embeddings đ˝ are nice (assigns similar embeddings to synonyms)
Why can language models that do well on cross-entropyobjective learn features that are useful for linear
classification tasks?
Loss due to suboptimality of LM
đ = â/0*+ đ& # â â/0*+ {đâ |#â }
Main Result: closer look
Why can language models that do well on cross-entropy objective learn features that are useful for linear classification tasks?
Guarantees for LM đ that is đ-optimal in cross-entropy
Use the output probabilities ÎŚđM(E)as đ- dimensional features
Upper bound on logistic regression loss for natural classification tasks
âđŻ ÎŚđ, # ⤠đ + đŞ đ
AG News 4 68.4 78.3 84.5 90.7
AG News* 4 71.4 83.0 88.0 91.1
Conditional mean features
đ đ&(#)(đ words)
đ&(#)(~ 20 words)
ÎŚđ&(#)(768 dim)
đ(đ )(768 dim)
SST 2 76.4 78.2 82.6 87.6
SST* 2 79.4 83.5 87.0 89.5
ÎŚđM E =3Q
đM E đ¤ đQ
Weighted average of word embeddings
â&'() đ* #
â đŻÎŚđ *
#
Observing đ dependence in practice
New way to extract đ-dimensional features from LM
âđŻ ÎŚđ, # ⤠đ + đŞ đ
Main take-aways
⢠Classification tasks â Sentence completion â Solve using đŁDđâ |;â > 0
⢠đ-optimal language model will do đŞ( đ) well on such tasks
⢠Softmax models can hope to learn ÎŚđâ |;â⢠Good to assign similar embeddings to synonyms
⢠Conditional mean features: ÎŚđ:(;)⢠Mathematically motivated way to extract đ-dimensional features from LMs
More in paper
⢠Connection between đ đ and ÎŚđM(E)
⢠Use insights to design new objective, alternative to cross-entropy
⢠Detailed bounds capture other intuitions
Future work
⢠Understand why đ(đ ) does better than ÎŚđM(E) in practice
⢠Bidirectional and masked language models (BERT and variants)⢠Theory applies when just one masked token
⢠Diverse set of NLP tasks⢠Does sentence completion view extend? Other insights?
⢠Role of finetuning, inductive biases⢠Needs more empirical exploration
⢠Self-supervised learning
Thank you!
⢠Happy to take questions
⢠Feel free to email: [email protected]
⢠ArXiv: https://arxiv.org/abs/2010.03648
â%&'( đ) #
â đŻÎŚđ )
#
âđŻ ÎŚđ* # ⤠đ + đŞ đ
I would recommend this movie. ___
đâ |#â J â đâ |#â L > 0
đâ |#â JâŚ
đâ |#â LâŚ
đâ |#â âđâđâ