A Mathematical Exploration of Language Models

Preview:

Citation preview

A Mathematical Exploration of Language Models

Nikunj SaunshiPrinceton University

Center of Mathematical Sciences and Applications, Harvard University10th February 2021

Language Models

LanguageModel

Context๐‘  = โ€œI went to the cafรฉ and ordered aโ€

Distribution๐‘โ‹…|#

0.3โ€ฆ0.2

0.05

0.0001

๐‘โ‹…|#(โ€œlatteโ€)

๐‘โ‹…|#(โ€œbagelโ€)

๐‘โ‹…|#(โ€œdolphinโ€)

Next word predictionFor context ๐‘ , predict what word ๐‘ค would follow it

Cross-entropy objectiveAssign high ๐‘โ‹…|# ๐‘ค for observed (๐‘ , ๐‘ค) pairs

๐”ผ(#,&) โˆ’ log ๐‘โ‹…|# ๐‘ค(๐‘ , ๐‘ค) pairs

Unlabeled dataGenerate (๐‘ , ๐‘ค) pairs using sentences

Success of Language Models

LanguageModel

Distribution: ๐‘โ‹…|#Context: ๐‘ 

Architecture: TransformerParameters: 175 B

Architecture: TransformerParameters: 1542 M

Architecture: RNNParameters: 24 M

Train using Cross-entropy

Text Generation Question Answering

Machine Translation

Downstream tasks

Sentence Classification

I bought coffee

J'ai achetรฉ du cafรฉ

โ€œScienceโ€ vs โ€œPoliticsโ€The capital of Spain is __

It was bright sunny day in โ€ฆโ€ฆโ€ฆโ€ฆ..

Main QuestionWhy should solving the next word prediction task help solve

seemingly unrelated downstream tasks with very little labeled data?

Rest of the talk

More general framework of โ€œsolving Task A helps with Task Bโ€

Our results for Language Models based on recent paperA Mathematical Exploration of Why Language Models Help Solve Downstream Tasks

Saunshi, Malladi, Arora, To Appear in ICLR 2021

Solving Task A helps with Task B

Solving Task A helps with Task B

โ€ข Humans can use the โ€experienceโ€ and โ€œskillsโ€ acquired from Task A to for new Task B efficiently

Language ModelingTask A: Next word predictionTask B: Downstream NLP task

โ†’

Ride Bicycle Ride Motorcycle

โ†’

Get a Math degree

Do well inlaw school later

โ†’

Do basic chores Excel at Karate

The Karate Kid

โ€ข Adapted in Machine Learning

โ€ข More data efficient than supervised learningโ€ข Requires fewer labeled samples than solving Task B from scratch using supervised learning

Initialize a model andfine-tune using labeled data

Extract features and learn a classifier using labeled data

Pretrain Model on Task AStage 1

Use Model on Task BStage 2

Other innovative ways of using pretrained model

Solving Task A helps with Task BLanguage Modeling

Task A: Next word predictionTask B: Downstream NLP task

โ€ข Transfer learningโ€ข Task A: Large supervised learning problem (ImageNet)โ€ข Task B: Object detection, Disease detection using X-ray images

โ€ข Meta-learningโ€ข Task A: Many small tasks related to Task Bโ€ข Task B: Related tasks (classify characters from a new language)

โ€ข Self-supervised learning (e.g. language modeling)โ€ข Task A: Constructed using unlabeled data โ€ข Task B: Downstream tasks of interest

Requires some labeled data in Task A

Requires only unlabeled data in Task A

Solving Task A helps with Task B

โ€œThis is the single most important problem to solve in AI todayโ€- Yann LeCun

https://www.wsj.com/articles/facebook-ai-chief-pushes-the-technologys-limits-11597334361

Self-Supervised Learning

Motivated by following observationsโ€ข Humans learn by observing/interacting with the world, without explicit supervisionโ€ข Supervised learning with labels is successful, but human annotations can be expensiveโ€ข Unlabeled data is available in abundance and is cheap to obtain

โ€ข Many practical algorithms following this principle do well on standard benchmarks, sometimes beating even supervised learning!

PrincipleUse unlabeled data to generate labels and construct supervised learning tasks

Self-Supervised Learning

Examples in practiceโ€ข Images

โ€ข Predict color of image from b/w versionโ€ข Reconstructing part of image from the rest of itโ€ข Predict the rotation applied to an image

โ€ข Textโ€ข Making representation of consecutive sentences in Wikipedia closeโ€ข Next word predictionโ€ข Fill in the multiple blanks in a sentence

Task A: Constructed from unlabeled dataTask B: Downstream task of interest

Just need raw images

Just need a large text corpus

Theory for Self-Supervised Learning

โ€ข We have very little mathematical understanding of this important problem.

โ€ข Theory can potentially helpโ€ข Formalize notions of โ€œskill learningโ€ from tasksโ€ข Ground existing intuitions in mathโ€ข Give new insights that can improve/design practical algorithms

โ€ข Existing theoretical frameworks fail to capture this settingโ€ข Task A and Task B are very different โ€ข Task A is agnostic to Task B

โ€ข We try to gain some understanding for one such method: language modeling

A Mathematical Exploration of Why Language Models Help Solve Downstream TasksSaunshi, Malladi, Arora, To Appear in ICLR 2021

Theory for Language ModelsTask A: Next word predictionTask B: Downstream NLP task

LanguageModel

Distribution over words:๐‘โ‹…|#

Context:๐‘  = โ€œI went to the cafรฉ and ordered aโ€

Why should solving the next word prediction task help solve seemingly unrelated downstream tasks with very little labeled data?

Stage 1 Pretrain Language

Model using on Next Word Prediction

Stage 2 Use LanguageModel for Downstream

Task

Theoretical setting

Representation LearningPerspective

Role of task & objective

Sentence classification

โœ“ Extract features from LM, learn linear classifiers: effective, data-efficient, can do math

โœ˜ Finetuning: Hard to quantify its benefit using current deep learning theory

โœ“ Why next word prediction (w/ cross-entropy objective) intrinsically helps

โœ˜ Inductive biases of architecture/algorithm: current tools are insufficient

โœ“ First-cut analysis. Already gives interesting insights

โœ˜ Other NLP tasks (question answering, etc.)

LanguageModel

Distribution: ๐‘โ‹…|#Context: ๐‘ 

What aspects of pretraining help?

What are downstream tasks?

How to use a pretrained model?

Theoretical setting

Why can language models that do well on cross-entropy objective learn featuresthat are useful for linear classification tasks?

LanguageModel

Distribution: ๐‘โ‹…|#Context: ๐‘ 

Extract ๐‘‘-dim features ๐‘“(๐‘ )

๐‘“

โ€œIt was an utter waste of time.โ€

โ€œI would recommend this movie.โ€

โ€œNegativeโ€

โ€œPositiveโ€

Result overview

Key idea

Formalization

Verification

Classifications tasks can be rephrased as sentence completion problems, thus making next word prediction a meaningful pretraining task

Show that LM that is ๐-optimal in cross-entropy learn featuresthat linearly solve such tasks up to ๐’ช ๐œ–

Experimentally verify theoretical insights (also design a new objective function)

Why can language models that do well on cross-entropyobjective learn features that are useful for linear

classification tasks?

Outline

โ€ข Language modelingโ€ข Cross-entropy and Softmax-parametrized LMs

โ€ข Downstream tasksโ€ข Sentence completion reformulation

โ€ข Formal guaranteesโ€ข ๐œ–-optimal LM โ‡’๐’ช ๐œ– -good on task

โ€ข Extensions, discussions and future work

Outline

โ€ข Language modelingโ€ข Cross-entropy and Softmax-parametrized LMs

โ€ข Downstream tasksโ€ข Sentence completion reformulation

โ€ข Formal guaranteesโ€ข ๐œ–-optimal LM โ‡’๐’ช ๐œ– -good on task

โ€ข Extensions, discussions and future work

Language Modeling: Cross-entropy

LanguageModel

Predicted dist.๐‘โ‹…|# โˆˆ โ„$

Context๐‘  = โ€œI went to the cafรฉ and ordered aโ€

0.3โ€ฆ0.2

0.05

0.0001

0.35โ€ฆ0.18

0.047

0.00005

True dist.๐‘โ‹…|#โˆ— โˆˆ โ„$

โ„“()*+ ๐‘โ‹…|# = ๐”ผ#,& โˆ’ log ๐‘โ‹…|# ๐‘ค

Optimal solution: Minimizer of โ„“&'() ๐‘โ‹…|# is ๐‘โ‹…|# = ๐‘โ‹…|#โˆ—

Proof: Can rewrite as โ„“&'() ๐‘โ‹…|# = ๐”ผ# ๐พ๐ฟ ๐‘โ‹…|#โˆ— , ๐‘โ‹…|# + ๐ถcross-entropy

Samples fromWhat does the best language model learn?

(minimizer of cross-entropy)

Language Modeling: Softmax

0.35โ€ฆ0.18

0.047

0.00005

True dist.๐‘โ‹…|#โˆ— โˆˆ โ„$

min,,-

โ„“()*+ ๐‘,(#)Optimal solution: For fixed ฮฆ, ๐‘“โˆ— that minimizes โ„“&'()satisfies ฮฆ๐‘*โˆ—(#) = ฮฆ๐‘โ‹…|#โˆ—

Can we still learn ๐‘*(#) = ๐‘โ‹…|#โˆ— exactly when ๐‘‘ < ๐‘‰?

Language Model

Context: ๐‘ 

0.3โ€ฆ0.2

0.05

0.0001

๐’‡ ๐’” โˆˆ โ„๐’…softmax

on ๐šฝ/๐’‡(s)๐šฝ โˆˆ โ„๐’…ร—๐‘ฝ

Softmax dist.๐‘*(#) โˆˆ โ„$

Word embeddingsFeatures

Proof: Use first-order condition (gradient = 0)โˆ‡- ๐พ๐ฟ ๐‘โ‹…|#โˆ— , ๐‘- = โˆ’ฮฆ๐‘- +ฮฆ๐‘โ‹…|#โˆ—

Only guaranteed to learn ๐‘โ‹…|#โˆ— on the ๐‘‘-dimensional subspace spanned by ๐šฝ

0.1โˆ’0.521.230.04

โˆ’0.3 โ‹ฏ 0.2โ‹ฎ โ‹ฑ โ‹ฎ0.8 โ‹ฏ โˆ’0.1

LMs with cross-entropy aims to learn ๐‘โ‹…|Eโˆ—

Softmax LMs with word embeddings ฮฆ can only be guaranteed to learn ฮฆ๐‘โ‹…|Eโˆ—

Outline

โ€ข Language modelingโ€ข Cross-entropy and Softmax-parametrized LMs

โ€ข Downstream tasksโ€ข Sentence completion reformulation

โ€ข Formal guaranteesโ€ข ๐œ–-optimal LM โ‡’๐’ช ๐œ– -good on task

โ€ข Extensions, discussions and future work

Classification task โ†’ Sentence completion

โ€ข Binary classification task ๐’ฏ. E.g. {(โ€œI would recommend this movie.โ€, +1), โ€ฆ, (โ€œIt was an utter waste of time.โ€, -1)}

โ€ข Language models aim to learn ๐‘โ‹…|#โˆ— (or on subspace). Can ๐‘โ‹…|#โˆ— even help solve ๐’ฏ

I would recommend this movie. ___

๐‘โ‹…|#โˆ— J โˆ’ ๐‘โ‹…|#โˆ— L > 0

+1, . . , โˆ’1, โ€ฆ , 0 /

๐‘โ‹…|#โˆ— Jโ€ฆ

๐‘โ‹…|#โˆ— Lโ€ฆ

๐‘โ‹…|#โˆ— โ€๐‘‡โ„Ž๐‘’โ€

> 0

๐‘โ‹…|#โˆ— Jโ€ฆ

๐‘โ‹…|#โˆ— Lโ€ฆ

๐‘โ‹…|#โˆ— โ€๐‘‡โ„Ž๐‘’โ€

๐‘ฃ% ๐‘โ‹…|#โˆ— > 0Linear classifier over ๐‘โ‹…|#โˆ—

Classification task โ†’ Sentence completion

I would recommend this movie. ___๐‘โ‹…|#โˆ— Jโ€ฆ

๐‘โ‹…|#โˆ— Lโ€ฆ

๐‘โ‹…|#โˆ— โ€๐‘‡โ„Ž๐‘’โ€

๐‘โ‹…|#โˆ—

Classification task โ†’ Sentence completion

I would recommend this movie. This movie was ___

๐‘โ‹…|#โˆ— (โ€œ๐‘Ÿ๐‘œ๐‘๐‘˜โ€)๐‘โ‹…|#โˆ— (โ€œ๐‘”๐‘œ๐‘œ๐‘‘โ€)

๐‘โ‹…|#โˆ— (โ€œ๐‘๐‘Ÿ๐‘–๐‘™๐‘™๐‘–๐‘Ž๐‘›๐‘กโ€)โ€ฆ

๐‘โ‹…|#โˆ— (โ€œ๐‘”๐‘Ž๐‘Ÿ๐‘๐‘Ž๐‘”๐‘’โ€)๐‘โ‹…|#โˆ— (โ€œ๐‘๐‘œ๐‘Ÿ๐‘–๐‘›๐‘”โ€)๐‘โ‹…|#โˆ— (โ€œโ„Ž๐‘’๐‘™๐‘™๐‘œโ€)

Prompt

I would recommend this movie. ___

๐‘โ‹…|#โˆ—

024โ€ฆโˆ’3โˆ’20

๐‘ฃ

๐‘ฃ% ๐‘โ‹…|#โˆ— > 0

Allows for a larger set of words that are grammatically correct completions

Extendable to other classification tasks (e.g., topic classification)

Experimental verification

โ€ข Verify sentence completion intuition (๐‘โ‹…|Eโˆ— can solve a task)โ€ข Task: SST is movie review sentiment classification taskโ€ข Learn linear classifier on subset of words on ๐‘&(#) from pretrained LM*

With prompt:โ€œThis movie isโ€ *Used GPT-2 (117M parameters)

๐‘˜ ๐‘&(#)(๐‘˜ words)

๐‘&(#)(~ 20 words)

๐‘“ ๐‘ (768 dim)

๐‘“)*)+ ๐‘ (768 dim)

Bag-of-words

SST 2 76.4 78.2 87.6 58.1 80.7

SST* 2 79.4 83.5 89.5 56.7 -

๐‘*(#) J

๐‘*(#) L

๐‘!(#) โ€๐‘”๐‘œ๐‘œ๐‘‘โ€๐‘!(#) โ€๐‘”๐‘Ÿ๐‘’๐‘Ž๐‘กโ€

โ€ฆ๐‘!(#) โ€๐‘๐‘œ๐‘Ÿ๐‘–๐‘›๐‘”โ€๐‘!(#) โ€๐‘๐‘Ž๐‘‘โ€

Features from LM

Features from random init LM

non-LM baseline

Classification tasks can be rephrased as sentence completion problems

This is the same as solving the task using a linear classifier on ๐‘โ‹…|Eโˆ— , i.e. ๐‘ฃJ๐‘โ‹…|Eโˆ— > 0

Outline

โ€ข Language modelingโ€ข Cross-entropy and Softmax-parametrized LMs

โ€ข Downstream tasksโ€ข Sentence completion reformulation

โ€ข Formal guaranteesโ€ข ๐œ–-optimal LM โ‡’๐’ช ๐œ– -good on task

โ€ข Extensions, discussions and future work

Natural task

minKโ„“๐’ฏ ๐‘โ‹…|Eโˆ— , ๐‘ฃ โ‰ค ๐œ๐œ - Natural task ๐’ฏ

๐œ captures how โ€naturalโ€ (amenable to sentence completion reformulation) the classification task is

Sentence completion reformulationโ‡’ Can solve using ๐‘ฃ%๐‘โ‹…|#โˆ— > 0

For any ๐ท-dim feature map ๐‘” and classifier ๐‘ฃโ„“๐’ฏ ๐‘” ๐‘  , ๐‘ฃ = ๐”ผ(#,.)[logistic-loss ๐‘ฃ%๐‘”(๐‘ ), ๐‘ฆ ]

๐‘”

โ€œIt was an utter waste of time.โ€

โ€œI would recommend this movie.โ€

โ€œNegativeโ€

โ€œPositiveโ€

Main Result

Naturalness of task(sentence completion view)

โ„“๐’ฏ ฮฆ๐‘: ; โ‰ค ๐œ + ๐’ช ๐œ–

Logistic regression loss of ๐‘‘-dimensional features ฮฆ๐‘& #

1. ๐’‡ is a LM that is ๐ โˆ’ optimal in cross-entropy (does well on next word prediction)

2. ๐’ฏ is a ๐‰ โˆ’ natural task (fits sentence completion view)

3. Word embeddings ๐šฝ are nice (assigns similar embeddings to synonyms)

Why can language models that do well on cross-entropyobjective learn features that are useful for linear

classification tasks?

Loss due to suboptimality of LM

๐œ– = โ„“/0*+ ๐‘& # โˆ’ โ„“/0*+ {๐‘โ‹…|#โˆ— }

Main Result: closer look

Why can language models that do well on cross-entropy objective learn features that are useful for linear classification tasks?

Guarantees for LM ๐‘“ that is ๐œ–-optimal in cross-entropy

Use the output probabilities ฮฆ๐‘M(E)as ๐‘‘- dimensional features

Upper bound on logistic regression loss for natural classification tasks

โ„“๐’ฏ ฮฆ๐‘, # โ‰ค ๐œ + ๐’ช ๐œ–

AG News 4 68.4 78.3 84.5 90.7

AG News* 4 71.4 83.0 88.0 91.1

Conditional mean features

๐‘˜ ๐‘&(#)(๐‘˜ words)

๐‘&(#)(~ 20 words)

ฮฆ๐‘&(#)(768 dim)

๐‘“(๐‘ )(768 dim)

SST 2 76.4 78.2 82.6 87.6

SST* 2 79.4 83.5 87.0 89.5

ฮฆ๐‘M E =3Q

๐‘M E ๐‘ค ๐œ™Q

Weighted average of word embeddings

โ„“&'() ๐‘* #

โ„“ ๐’ฏฮฆ๐‘ *

#

Observing ๐œ– dependence in practice

New way to extract ๐‘‘-dimensional features from LM

โ„“๐’ฏ ฮฆ๐‘, # โ‰ค ๐œ + ๐’ช ๐œ–

Main take-aways

โ€ข Classification tasks โ†’ Sentence completion โ†’ Solve using ๐‘ฃD๐‘โ‹…|;โˆ— > 0

โ€ข ๐œ–-optimal language model will do ๐’ช( ๐œ–) well on such tasks

โ€ข Softmax models can hope to learn ฮฆ๐‘โ‹…|;โˆ—โ€ข Good to assign similar embeddings to synonyms

โ€ข Conditional mean features: ฮฆ๐‘:(;)โ€ข Mathematically motivated way to extract ๐‘‘-dimensional features from LMs

More in paper

โ€ข Connection between ๐‘“ ๐‘  and ฮฆ๐‘M(E)

โ€ข Use insights to design new objective, alternative to cross-entropy

โ€ข Detailed bounds capture other intuitions

Future work

โ€ข Understand why ๐‘“(๐‘ ) does better than ฮฆ๐‘M(E) in practice

โ€ข Bidirectional and masked language models (BERT and variants)โ€ข Theory applies when just one masked token

โ€ข Diverse set of NLP tasksโ€ข Does sentence completion view extend? Other insights?

โ€ข Role of finetuning, inductive biasesโ€ข Needs more empirical exploration

โ€ข Self-supervised learning

Thank you!

โ€ข Happy to take questions

โ€ข Feel free to email: nsaunshi@cs.princeton.edu

โ€ข ArXiv: https://arxiv.org/abs/2010.03648

โ„“%&'( ๐‘) #

โ„“ ๐’ฏฮฆ๐‘ )

#

โ„“๐’ฏ ฮฆ๐‘* # โ‰ค ๐œ + ๐’ช ๐œ–

I would recommend this movie. ___

๐‘โ‹…|#โˆ— J โˆ’ ๐‘โ‹…|#โˆ— L > 0

๐‘โ‹…|#โˆ— Jโ€ฆ

๐‘โ‹…|#โˆ— Lโ€ฆ

๐‘โ‹…|#โˆ— โ€๐‘‡โ„Ž๐‘’โ€

Recommended