62
CS11-747 Neural Networks for NLP Introduction, Bag-of-words, and Multi-layer Perceptron Graham Neubig Site https://phontron.com/class/nn4nlp2021/

CS11-747 Neural Networks for NLP Introduction, Bag-of

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS11-747 Neural Networks for NLP Introduction, Bag-of

CS11-747 Neural Networks for NLP

Introduction, Bag-of-words, and

Multi-layer PerceptronGraham Neubig

Sitehttps://phontron.com/class/nn4nlp2021/

Page 2: CS11-747 Neural Networks for NLP Introduction, Bag-of

Language is Hard!

Page 3: CS11-747 Neural Networks for NLP Introduction, Bag-of

Are These Sentences OK?• Jane went to the store.

• store to Jane went the.

• Jane went store.

• Jane goed to the store.

• The store went to Jane.

• The food truck went to Jane.

Page 4: CS11-747 Neural Networks for NLP Introduction, Bag-of

Engineering Solutions• Jane went to the store.

• store to Jane went the.

• Jane went store.

• Jane goed to the store.

• The store went to Jane.

• The food truck went to Jane.

} Create a grammar of the language

} Consider morphology and exceptions

} Semantic categories, preferences

} And their exceptions

Page 5: CS11-747 Neural Networks for NLP Introduction, Bag-of

Are These Sentences OK?• ジェインは店へ⾏った。

• は店⾏ったジェインは。

• ジェインは店へ⾏た。

• 店はジェインへ⾏った。

• 屋台はジェインのところへ⾏った。

Page 6: CS11-747 Neural Networks for NLP Introduction, Bag-of

Phenomena to Handle• Morphology

• Syntax

• Semantics/World Knowledge

• Discourse

• Pragmatics

• Multilinguality

Page 7: CS11-747 Neural Networks for NLP Introduction, Bag-of

Neural Nets for NLP

• Neural nets are a tool to do hard things!

• This class will give you the tools to handle the problems you want to solve in NLP.

Page 8: CS11-747 Neural Networks for NLP Introduction, Bag-of

Class Format/Structure

Page 9: CS11-747 Neural Networks for NLP Introduction, Bag-of

(Special Remote) Class Format

• Before class: Watch lecture video, often do reading

• During class:• Discussion: Gather in Zoom to discuss some

questions presented in the video • Code/Data Walk: The TAs (or instructor) will

sometimes walk through some demonstration code, data, or model predictions

• After class: Do quiz about material

Page 10: CS11-747 Neural Networks for NLP Introduction, Bag-of

Scope of Teaching• Basics of general neural network knowledge

-> Covered briefly (see reading and ask TAs if you are not familiar). Will have recitation.

• Advanced training techniques for neural networks -> Some coverage, like VAEs and adversarial training, mostly from the scope of NLP, not as much as other DL classes

• Advanced NLP-related neural network architectures -> Covered in detail

• Structured prediction and structured models in neural nets -> Covered in detail

• Implementation details salient to NLP -> Covered in detail

Page 11: CS11-747 Neural Networks for NLP Introduction, Bag-of

Assignments• Assignment 1 - Build-your-own Neural Network Toolkit:

Individually implement some parts of a neural network • Assignment 2 - Text Classifier / Questionnaire:

Individually implement a text classifier and fill in questionnaire on topics of interest

• Assignment 3 - SOTA Survey / Re-implementation: Re-implement and reproduce results from a recently published paper

• Assignment 4 - Final Project: Perform a unique project that either (1) improves on state-of-the-art, or (2) applies neural net models to a unique task

Page 12: CS11-747 Neural Networks for NLP Introduction, Bag-of

Instructors• Instructor: Graham Neubig (natural language analysis,

multilingual NLP, ML for NLP) • Co-Instructor: Pengfei Liu (text summarization, information

extraction, and interpretable evaluation) • TAs:

• Shuyan Zhou (natural language command and control) • Zhisong Zhang (syntax and shallow semantic analysis) • Divyansh Kaushik (robustness, causality, human-in-the-loop) • Zhengbao Jiang (knowledge and large language models) • Ritam Dutt (AI for social good, discourse and pragmatics)

• Piazza: http://piazza.com/cmu/spring2021/cs11747/home

Page 13: CS11-747 Neural Networks for NLP Introduction, Bag-of

Neural Networks: A Tool for Doing Hard Things

Page 14: CS11-747 Neural Networks for NLP Introduction, Bag-of

An Example Prediction Problem: Sentence Classification

I hate this movie

I love this movie

very good good

neutral bad

very bad

very good good

neutral bad

very bad

Page 15: CS11-747 Neural Networks for NLP Introduction, Bag-of

A First Try: Bag of Words (BOW)

I hate this movie

lookup lookup lookup lookup

+ + + +

bias

=

scores

softmax

probs

Page 16: CS11-747 Neural Networks for NLP Introduction, Bag-of

What do Our Vectors Represent?

• Each word has its own 5 elements corresponding to [very good, good, neutral, bad, very bad]

• “hate” will have a high value for “very bad”, etc.

Page 17: CS11-747 Neural Networks for NLP Introduction, Bag-of

Build It, Break It

There’s nothing I don’t love about this movie

very good good

neutral bad

very bad

I don’t love this movie

very good good

neutral bad

very bad

Page 18: CS11-747 Neural Networks for NLP Introduction, Bag-of

Combination Features

• Does it contain “don’t” and “love”?

• Does it contain “don’t”, “i”, “love”, and “nothing”?

Page 19: CS11-747 Neural Networks for NLP Introduction, Bag-of

Basic Idea of Neural Networks (for NLP Prediction Tasks)

I hate this movie

lookup lookup lookup lookup

softmax

probs

some complicated function to extract

combination features (neural net)

scores

Page 20: CS11-747 Neural Networks for NLP Introduction, Bag-of

Continuous Bag of Words (CBOW)

I hate this movie

+

bias

=

scores

+ + +

lookup lookup lookuplookup

W

=

Page 21: CS11-747 Neural Networks for NLP Introduction, Bag-of

What do Our Vectors Represent?

• Each vector has “features” (e.g. is this an animate object? is this a positive word, etc.)

• We sum these features, then use these to make predictions

• Still no combination features: only the expressive power of a linear model, but dimension reduced

Page 22: CS11-747 Neural Networks for NLP Introduction, Bag-of

Deep CBOWI hate this movie

+

bias

=

scores

W

+ + +=

tanh( W1*h + b1)

tanh( W2*h + b2)

Page 23: CS11-747 Neural Networks for NLP Introduction, Bag-of

What do Our Vectors Represent?

• Now things are more interesting!

• We can learn feature combinations (a node in the second layer might be “feature 1 AND feature 5 are active”)

• e.g. capture things such as “not” AND “hate”

Page 24: CS11-747 Neural Networks for NLP Introduction, Bag-of

What is a Neural Net?: Computation Graphs

Page 25: CS11-747 Neural Networks for NLP Introduction, Bag-of

“Neural” NetsOriginal Motivation: Neurons in the Brain

Image credit: Wikipedia

Current Conception: Computation Graphs

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

b

f(u,v) = u · v

c

f(x1, x2, x3) =X

i

xi

Page 26: CS11-747 Neural Networks for NLP Introduction, Bag-of

y = x>Ax+ b · x+ c

A node is a {tensor, matrix, vector, scalar} value

expression:

x

graph:

Page 27: CS11-747 Neural Networks for NLP Introduction, Bag-of

y = x>Ax+ b · x+ c

x

expression:

graph:

An edge represents a function argument (and also an data dependency). They are just pointers to nodes.A node with an incoming edge is a function of that edge’s tail node.

f(u) = u>

A node knows how to compute its value and the value of its derivative w.r.t each argument (edge) times a derivative of an arbitrary input .@F

@f(u)

@f(u)

@u

@F@f(u)

=

✓@F

@f(u)

◆>

Page 28: CS11-747 Neural Networks for NLP Introduction, Bag-of

y = x>Ax+ b · x+ c

x

f(u) = u>

A

f(U,V) = UV

expression:

graph:

Functions can be nullary, unary, binary, … n-ary. Often they are unary or binary.

Page 29: CS11-747 Neural Networks for NLP Introduction, Bag-of

y = x>Ax+ b · x+ c

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

expression:

graph:

Computation graphs are directed and acyclic (in DyNet)

Page 30: CS11-747 Neural Networks for NLP Introduction, Bag-of

y = x>Ax+ b · x+ c

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

x A

f(x,A) = x>Ax

@f(x,A)

@A= xx>

@f(x,A)

@x= (A> +A)x

expression:

graph:

Page 31: CS11-747 Neural Networks for NLP Introduction, Bag-of

y = x>Ax+ b · x+ c

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

b

f(u,v) = u · v

c

f(x1, x2, x3) =X

i

xi

expression:

graph:

Page 32: CS11-747 Neural Networks for NLP Introduction, Bag-of

y = x>Ax+ b · x+ c

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

b

f(u,v) = u · v

c

yf(x1, x2, x3) =

X

i

xi

expression:

graph:

variable names are just labelings of nodes.

Page 33: CS11-747 Neural Networks for NLP Introduction, Bag-of

Algorithms (1)

• Graph construction

• Forward propagation

• In topological order, compute the value of the node given its inputs

Page 34: CS11-747 Neural Networks for NLP Introduction, Bag-of

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

b

f(u,v) = u · v

c

f(x1, x2, x3) =X

i

xigraph:

Forward Propagation

Page 35: CS11-747 Neural Networks for NLP Introduction, Bag-of

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

b

f(u,v) = u · v

c

f(x1, x2, x3) =X

i

xigraph:

Forward Propagation

Page 36: CS11-747 Neural Networks for NLP Introduction, Bag-of

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

b

f(u,v) = u · v

c

f(x1, x2, x3) =X

i

xigraph:

Forward Propagation

Page 37: CS11-747 Neural Networks for NLP Introduction, Bag-of

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

b

f(u,v) = u · v

c

f(x1, x2, x3) =X

i

xigraph:

x>

Forward Propagation

Page 38: CS11-747 Neural Networks for NLP Introduction, Bag-of

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

b

f(u,v) = u · v

c

f(x1, x2, x3) =X

i

xigraph:

x>

x>A

Forward Propagation

Page 39: CS11-747 Neural Networks for NLP Introduction, Bag-of

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

b

f(u,v) = u · v

c

f(x1, x2, x3) =X

i

xigraph:

x>

x>A

b · x

Forward Propagation

Page 40: CS11-747 Neural Networks for NLP Introduction, Bag-of

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

b

f(u,v) = u · v

c

f(x1, x2, x3) =X

i

xigraph:

x>

x>A

b · x

x>Ax

Forward Propagation

Page 41: CS11-747 Neural Networks for NLP Introduction, Bag-of

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

b

f(u,v) = u · v

c

f(x1, x2, x3) =X

i

xigraph:

x>

x>A

b · x

x>Ax

Forward Propagation

x>Ax+ b · x+ c

Page 42: CS11-747 Neural Networks for NLP Introduction, Bag-of

Algorithms (2)• Back-propagation:

• Process examples in reverse topological order • Calculate the derivatives of the parameters with

respect to the final value (This is usually a “loss function”, a value we want to minimize)

• Parameter update:• Move the parameters in the direction of this

derivative W -= α * dl/dW

Page 43: CS11-747 Neural Networks for NLP Introduction, Bag-of

x

f(u) = u>

A

f(U,V) = UV

f(M,v) = Mv

b

f(u,v) = u · v

c

f(x1, x2, x3) =X

i

xigraph:

Back Propagation

Page 44: CS11-747 Neural Networks for NLP Introduction, Bag-of

Concrete Implementation Examples

Page 45: CS11-747 Neural Networks for NLP Introduction, Bag-of

Neural Network Frameworks

Page 46: CS11-747 Neural Networks for NLP Introduction, Bag-of

Basic Process in (Dynamic) Neural Network Frameworks

• Create a model

• For each example

• create a graph that represents the computation you want

• calculate the result of that computation

• if training, perform back propagation and update

Page 47: CS11-747 Neural Networks for NLP Introduction, Bag-of

Bag of Words (BOW)I hate this movie

lookup lookup lookup lookup

+ + + +

bias

=

scores

softmax

probs

Page 48: CS11-747 Neural Networks for NLP Introduction, Bag-of

Continuous Bag of Words (CBOW)

I hate this movie

+

bias

=

scores

+ + +

lookup lookup lookuplookup

W

=

Page 49: CS11-747 Neural Networks for NLP Introduction, Bag-of

Deep CBOWI hate this movie

+

bias

=

scores

W

+ + +=

tanh( W1*h + b1)

tanh( W2*h + b2)

Page 50: CS11-747 Neural Networks for NLP Introduction, Bag-of

Things to Remember Going Forward

Page 51: CS11-747 Neural Networks for NLP Introduction, Bag-of

Things to Remember• Neural nets are powerful!

• They are universal function approximators, can calculate any continuous function

• But language is hard, and data is limited.

• We need to design our networks to have inductive bias, to make it easy to learn things we’d like to learn.

Page 52: CS11-747 Neural Networks for NLP Introduction, Bag-of

Class Plan

Page 53: CS11-747 Neural Networks for NLP Introduction, Bag-of

Topic 1: Models of Sentences/Sequences

undeserved

NN

• Bag of words, bag of n-grams • Convolutional nets • Recurrent neural networks and variations • Modeling documents and longer texts

this movie’s reputation is

Page 54: CS11-747 Neural Networks for NLP Introduction, Bag-of

Topic 2: Implementing, Debugging, and Interpreting

• Implementation: How to efficiently and effectively implement your models

• Debugging: How to find problems in your implemented models

• Interpretation: How to find why your model made a prediction?

Example: [Ribeiro+ 16]

Page 55: CS11-747 Neural Networks for NLP Introduction, Bag-of

Topic 3: Conditioned GenerationI hate this movie

LSTM LSTM LSTM LSTM LSTM

</s>

LSTM LSTM LSTM LSTM

この 映画 が 嫌い

argmax

この 映画argmax

がargmax

嫌いargmax

</s>argmax

• Encoder decoder models

• Attentional models, self-attention (Transformers)

Page 56: CS11-747 Neural Networks for NLP Introduction, Bag-of

Topic 4: Pre-trained Embeddings

• Pre-training word embeddings, contextualized word embeddings, sentence embeddings

• Design decisions in pre-training: model, data, objective

Page 57: CS11-747 Neural Networks for NLP Introduction, Bag-of

Topic 5: Structured Prediction Models

I hate this movie

LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM

PRP VB DT NN• CRFs, and other marginalization-based training • REINFORCE, minimum risk training • Margin-based and search-based training methods • Advanced search algorithms

Page 58: CS11-747 Neural Networks for NLP Introduction, Bag-of

Topic 6: Models of Tree/Graph Structures

• Shift reduce, minimum spanning tree parsing

• Tree structured compositions

• Models of graph structures

I hate this movie

RNN

RNN

RNN

Page 59: CS11-747 Neural Networks for NLP Introduction, Bag-of

Topic 7: Advanced Learning Techniques

• Models with Latent Random Variables

• Adversarial Networks

• Semi-supervised and Unsupervised Learning

Page 60: CS11-747 Neural Networks for NLP Introduction, Bag-of

Topic 8: Knowledge-based and Text-based QA

• Learning and QA over knowledge graphs

• Machine reading and text-based QA

animal

dog cat

is-a is-a

Page 61: CS11-747 Neural Networks for NLP Introduction, Bag-of

Topic 9: Multi-task and Multilingual Learning

• Multi-task and transfer learning

• Multilingual learning of representations

I hate this movieこの 映画 が 嫌い

PRP VB DT NN

Page 62: CS11-747 Neural Networks for NLP Introduction, Bag-of

Any Questions?