View
1
Download
0
Category
Preview:
Citation preview
A Mathematical Exploration of Language Models
Nikunj SaunshiPrinceton University
Center of Mathematical Sciences and Applications, Harvard University10th February 2021
Language Models
LanguageModel
Context๐ = โI went to the cafรฉ and ordered aโ
Distribution๐โ |#
0.3โฆ0.2
0.05
0.0001
๐โ |#(โlatteโ)
๐โ |#(โbagelโ)
๐โ |#(โdolphinโ)
Next word predictionFor context ๐ , predict what word ๐ค would follow it
Cross-entropy objectiveAssign high ๐โ |# ๐ค for observed (๐ , ๐ค) pairs
๐ผ(#,&) โ log ๐โ |# ๐ค(๐ , ๐ค) pairs
Unlabeled dataGenerate (๐ , ๐ค) pairs using sentences
Success of Language Models
LanguageModel
Distribution: ๐โ |#Context: ๐
Architecture: TransformerParameters: 175 B
Architecture: TransformerParameters: 1542 M
Architecture: RNNParameters: 24 M
Train using Cross-entropy
Text Generation Question Answering
Machine Translation
Downstream tasks
Sentence Classification
I bought coffee
J'ai achetรฉ du cafรฉ
โScienceโ vs โPoliticsโThe capital of Spain is __
It was bright sunny day in โฆโฆโฆโฆ..
Main QuestionWhy should solving the next word prediction task help solve
seemingly unrelated downstream tasks with very little labeled data?
Rest of the talk
More general framework of โsolving Task A helps with Task Bโ
Our results for Language Models based on recent paperA Mathematical Exploration of Why Language Models Help Solve Downstream Tasks
Saunshi, Malladi, Arora, To Appear in ICLR 2021
Solving Task A helps with Task B
Solving Task A helps with Task B
โข Humans can use the โexperienceโ and โskillsโ acquired from Task A to for new Task B efficiently
Language ModelingTask A: Next word predictionTask B: Downstream NLP task
โ
Ride Bicycle Ride Motorcycle
โ
Get a Math degree
Do well inlaw school later
โ
Do basic chores Excel at Karate
The Karate Kid
โข Adapted in Machine Learning
โข More data efficient than supervised learningโข Requires fewer labeled samples than solving Task B from scratch using supervised learning
Initialize a model andfine-tune using labeled data
Extract features and learn a classifier using labeled data
Pretrain Model on Task AStage 1
Use Model on Task BStage 2
Other innovative ways of using pretrained model
Solving Task A helps with Task BLanguage Modeling
Task A: Next word predictionTask B: Downstream NLP task
โข Transfer learningโข Task A: Large supervised learning problem (ImageNet)โข Task B: Object detection, Disease detection using X-ray images
โข Meta-learningโข Task A: Many small tasks related to Task Bโข Task B: Related tasks (classify characters from a new language)
โข Self-supervised learning (e.g. language modeling)โข Task A: Constructed using unlabeled data โข Task B: Downstream tasks of interest
Requires some labeled data in Task A
Requires only unlabeled data in Task A
Solving Task A helps with Task B
โThis is the single most important problem to solve in AI todayโ- Yann LeCun
https://www.wsj.com/articles/facebook-ai-chief-pushes-the-technologys-limits-11597334361
Self-Supervised Learning
Motivated by following observationsโข Humans learn by observing/interacting with the world, without explicit supervisionโข Supervised learning with labels is successful, but human annotations can be expensiveโข Unlabeled data is available in abundance and is cheap to obtain
โข Many practical algorithms following this principle do well on standard benchmarks, sometimes beating even supervised learning!
PrincipleUse unlabeled data to generate labels and construct supervised learning tasks
Self-Supervised Learning
Examples in practiceโข Images
โข Predict color of image from b/w versionโข Reconstructing part of image from the rest of itโข Predict the rotation applied to an image
โข Textโข Making representation of consecutive sentences in Wikipedia closeโข Next word predictionโข Fill in the multiple blanks in a sentence
Task A: Constructed from unlabeled dataTask B: Downstream task of interest
Just need raw images
Just need a large text corpus
Theory for Self-Supervised Learning
โข We have very little mathematical understanding of this important problem.
โข Theory can potentially helpโข Formalize notions of โskill learningโ from tasksโข Ground existing intuitions in mathโข Give new insights that can improve/design practical algorithms
โข Existing theoretical frameworks fail to capture this settingโข Task A and Task B are very different โข Task A is agnostic to Task B
โข We try to gain some understanding for one such method: language modeling
A Mathematical Exploration of Why Language Models Help Solve Downstream TasksSaunshi, Malladi, Arora, To Appear in ICLR 2021
Theory for Language ModelsTask A: Next word predictionTask B: Downstream NLP task
LanguageModel
Distribution over words:๐โ |#
Context:๐ = โI went to the cafรฉ and ordered aโ
Why should solving the next word prediction task help solve seemingly unrelated downstream tasks with very little labeled data?
Stage 1 Pretrain Language
Model using on Next Word Prediction
Stage 2 Use LanguageModel for Downstream
Task
Theoretical setting
Representation LearningPerspective
Role of task & objective
Sentence classification
โ Extract features from LM, learn linear classifiers: effective, data-efficient, can do math
โ Finetuning: Hard to quantify its benefit using current deep learning theory
โ Why next word prediction (w/ cross-entropy objective) intrinsically helps
โ Inductive biases of architecture/algorithm: current tools are insufficient
โ First-cut analysis. Already gives interesting insights
โ Other NLP tasks (question answering, etc.)
LanguageModel
Distribution: ๐โ |#Context: ๐
What aspects of pretraining help?
What are downstream tasks?
How to use a pretrained model?
Theoretical setting
Why can language models that do well on cross-entropy objective learn featuresthat are useful for linear classification tasks?
LanguageModel
Distribution: ๐โ |#Context: ๐
Extract ๐-dim features ๐(๐ )
๐
โIt was an utter waste of time.โ
โI would recommend this movie.โ
โNegativeโ
โPositiveโ
Result overview
Key idea
Formalization
Verification
Classifications tasks can be rephrased as sentence completion problems, thus making next word prediction a meaningful pretraining task
Show that LM that is ๐-optimal in cross-entropy learn featuresthat linearly solve such tasks up to ๐ช ๐
Experimentally verify theoretical insights (also design a new objective function)
Why can language models that do well on cross-entropyobjective learn features that are useful for linear
classification tasks?
Outline
โข Language modelingโข Cross-entropy and Softmax-parametrized LMs
โข Downstream tasksโข Sentence completion reformulation
โข Formal guaranteesโข ๐-optimal LM โ๐ช ๐ -good on task
โข Extensions, discussions and future work
Outline
โข Language modelingโข Cross-entropy and Softmax-parametrized LMs
โข Downstream tasksโข Sentence completion reformulation
โข Formal guaranteesโข ๐-optimal LM โ๐ช ๐ -good on task
โข Extensions, discussions and future work
Language Modeling: Cross-entropy
LanguageModel
Predicted dist.๐โ |# โ โ$
Context๐ = โI went to the cafรฉ and ordered aโ
0.3โฆ0.2
0.05
0.0001
0.35โฆ0.18
0.047
0.00005
True dist.๐โ |#โ โ โ$
โ()*+ ๐โ |# = ๐ผ#,& โ log ๐โ |# ๐ค
Optimal solution: Minimizer of โ&'() ๐โ |# is ๐โ |# = ๐โ |#โ
Proof: Can rewrite as โ&'() ๐โ |# = ๐ผ# ๐พ๐ฟ ๐โ |#โ , ๐โ |# + ๐ถcross-entropy
Samples fromWhat does the best language model learn?
(minimizer of cross-entropy)
Language Modeling: Softmax
0.35โฆ0.18
0.047
0.00005
True dist.๐โ |#โ โ โ$
min,,-
โ()*+ ๐,(#)Optimal solution: For fixed ฮฆ, ๐โ that minimizes โ&'()satisfies ฮฆ๐*โ(#) = ฮฆ๐โ |#โ
Can we still learn ๐*(#) = ๐โ |#โ exactly when ๐ < ๐?
Language Model
Context: ๐
0.3โฆ0.2
0.05
0.0001
๐ ๐ โ โ๐ softmax
on ๐ฝ/๐(s)๐ฝ โ โ๐ ร๐ฝ
Softmax dist.๐*(#) โ โ$
Word embeddingsFeatures
Proof: Use first-order condition (gradient = 0)โ- ๐พ๐ฟ ๐โ |#โ , ๐- = โฮฆ๐- +ฮฆ๐โ |#โ
Only guaranteed to learn ๐โ |#โ on the ๐-dimensional subspace spanned by ๐ฝ
0.1โ0.521.230.04
โ0.3 โฏ 0.2โฎ โฑ โฎ0.8 โฏ โ0.1
LMs with cross-entropy aims to learn ๐โ |Eโ
Softmax LMs with word embeddings ฮฆ can only be guaranteed to learn ฮฆ๐โ |Eโ
Outline
โข Language modelingโข Cross-entropy and Softmax-parametrized LMs
โข Downstream tasksโข Sentence completion reformulation
โข Formal guaranteesโข ๐-optimal LM โ๐ช ๐ -good on task
โข Extensions, discussions and future work
Classification task โ Sentence completion
โข Binary classification task ๐ฏ. E.g. {(โI would recommend this movie.โ, +1), โฆ, (โIt was an utter waste of time.โ, -1)}
โข Language models aim to learn ๐โ |#โ (or on subspace). Can ๐โ |#โ even help solve ๐ฏ
I would recommend this movie. ___
๐โ |#โ J โ ๐โ |#โ L > 0
+1, . . , โ1, โฆ , 0 /
๐โ |#โ Jโฆ
๐โ |#โ Lโฆ
๐โ |#โ โ๐โ๐โ
> 0
๐โ |#โ Jโฆ
๐โ |#โ Lโฆ
๐โ |#โ โ๐โ๐โ
๐ฃ% ๐โ |#โ > 0Linear classifier over ๐โ |#โ
Classification task โ Sentence completion
I would recommend this movie. ___๐โ |#โ Jโฆ
๐โ |#โ Lโฆ
๐โ |#โ โ๐โ๐โ
๐โ |#โ
Classification task โ Sentence completion
I would recommend this movie. This movie was ___
๐โ |#โ (โ๐๐๐๐โ)๐โ |#โ (โ๐๐๐๐โ)
๐โ |#โ (โ๐๐๐๐๐๐๐๐๐กโ)โฆ
๐โ |#โ (โ๐๐๐๐๐๐๐โ)๐โ |#โ (โ๐๐๐๐๐๐โ)๐โ |#โ (โโ๐๐๐๐โ)
Prompt
I would recommend this movie. ___
๐โ |#โ
024โฆโ3โ20
๐ฃ
๐ฃ% ๐โ |#โ > 0
Allows for a larger set of words that are grammatically correct completions
Extendable to other classification tasks (e.g., topic classification)
Experimental verification
โข Verify sentence completion intuition (๐โ |Eโ can solve a task)โข Task: SST is movie review sentiment classification taskโข Learn linear classifier on subset of words on ๐&(#) from pretrained LM*
With prompt:โThis movie isโ *Used GPT-2 (117M parameters)
๐ ๐&(#)(๐ words)
๐&(#)(~ 20 words)
๐ ๐ (768 dim)
๐)*)+ ๐ (768 dim)
Bag-of-words
SST 2 76.4 78.2 87.6 58.1 80.7
SST* 2 79.4 83.5 89.5 56.7 -
๐*(#) J
๐*(#) L
๐!(#) โ๐๐๐๐โ๐!(#) โ๐๐๐๐๐กโ
โฆ๐!(#) โ๐๐๐๐๐๐โ๐!(#) โ๐๐๐โ
Features from LM
Features from random init LM
non-LM baseline
Classification tasks can be rephrased as sentence completion problems
This is the same as solving the task using a linear classifier on ๐โ |Eโ , i.e. ๐ฃJ๐โ |Eโ > 0
Outline
โข Language modelingโข Cross-entropy and Softmax-parametrized LMs
โข Downstream tasksโข Sentence completion reformulation
โข Formal guaranteesโข ๐-optimal LM โ๐ช ๐ -good on task
โข Extensions, discussions and future work
Natural task
minKโ๐ฏ ๐โ |Eโ , ๐ฃ โค ๐๐ - Natural task ๐ฏ
๐ captures how โnaturalโ (amenable to sentence completion reformulation) the classification task is
Sentence completion reformulationโ Can solve using ๐ฃ%๐โ |#โ > 0
For any ๐ท-dim feature map ๐ and classifier ๐ฃโ๐ฏ ๐ ๐ , ๐ฃ = ๐ผ(#,.)[logistic-loss ๐ฃ%๐(๐ ), ๐ฆ ]
๐
โIt was an utter waste of time.โ
โI would recommend this movie.โ
โNegativeโ
โPositiveโ
Main Result
Naturalness of task(sentence completion view)
โ๐ฏ ฮฆ๐: ; โค ๐ + ๐ช ๐
Logistic regression loss of ๐-dimensional features ฮฆ๐& #
1. ๐ is a LM that is ๐ โ optimal in cross-entropy (does well on next word prediction)
2. ๐ฏ is a ๐ โ natural task (fits sentence completion view)
3. Word embeddings ๐ฝ are nice (assigns similar embeddings to synonyms)
Why can language models that do well on cross-entropyobjective learn features that are useful for linear
classification tasks?
Loss due to suboptimality of LM
๐ = โ/0*+ ๐& # โ โ/0*+ {๐โ |#โ }
Main Result: closer look
Why can language models that do well on cross-entropy objective learn features that are useful for linear classification tasks?
Guarantees for LM ๐ that is ๐-optimal in cross-entropy
Use the output probabilities ฮฆ๐M(E)as ๐- dimensional features
Upper bound on logistic regression loss for natural classification tasks
โ๐ฏ ฮฆ๐, # โค ๐ + ๐ช ๐
AG News 4 68.4 78.3 84.5 90.7
AG News* 4 71.4 83.0 88.0 91.1
Conditional mean features
๐ ๐&(#)(๐ words)
๐&(#)(~ 20 words)
ฮฆ๐&(#)(768 dim)
๐(๐ )(768 dim)
SST 2 76.4 78.2 82.6 87.6
SST* 2 79.4 83.5 87.0 89.5
ฮฆ๐M E =3Q
๐M E ๐ค ๐Q
Weighted average of word embeddings
โ&'() ๐* #
โ ๐ฏฮฆ๐ *
#
Observing ๐ dependence in practice
New way to extract ๐-dimensional features from LM
โ๐ฏ ฮฆ๐, # โค ๐ + ๐ช ๐
Main take-aways
โข Classification tasks โ Sentence completion โ Solve using ๐ฃD๐โ |;โ > 0
โข ๐-optimal language model will do ๐ช( ๐) well on such tasks
โข Softmax models can hope to learn ฮฆ๐โ |;โโข Good to assign similar embeddings to synonyms
โข Conditional mean features: ฮฆ๐:(;)โข Mathematically motivated way to extract ๐-dimensional features from LMs
More in paper
โข Connection between ๐ ๐ and ฮฆ๐M(E)
โข Use insights to design new objective, alternative to cross-entropy
โข Detailed bounds capture other intuitions
Future work
โข Understand why ๐(๐ ) does better than ฮฆ๐M(E) in practice
โข Bidirectional and masked language models (BERT and variants)โข Theory applies when just one masked token
โข Diverse set of NLP tasksโข Does sentence completion view extend? Other insights?
โข Role of finetuning, inductive biasesโข Needs more empirical exploration
โข Self-supervised learning
Thank you!
โข Happy to take questions
โข Feel free to email: nsaunshi@cs.princeton.edu
โข ArXiv: https://arxiv.org/abs/2010.03648
โ%&'( ๐) #
โ ๐ฏฮฆ๐ )
#
โ๐ฏ ฮฆ๐* # โค ๐ + ๐ช ๐
I would recommend this movie. ___
๐โ |#โ J โ ๐โ |#โ L > 0
๐โ |#โ Jโฆ
๐โ |#โ Lโฆ
๐โ |#โ โ๐โ๐โ
Recommended