CS11-747 Neural Networks for NLP Introduction, Bag-of

CS11-747 Neural Networks for NLP

Introduction, Bag-of-words, and

Multi-layer PerceptronGraham Neubig

Sitehttps://phontron.com/class/nn4nlp2021/

Language is Hard!

Are These Sentences OK?• Jane went to the store.

• store to Jane went the.

• Jane went store.

• Jane goed to the store.

• The store went to Jane.

• The food truck went to Jane.

Engineering Solutions• Jane went to the store.

• store to Jane went the.

• Jane went store.

• Jane goed to the store.

• The store went to Jane.

• The food truck went to Jane.

} Create a grammar of the language

} Consider morphology and exceptions

} Semantic categories, preferences

} And their exceptions

Are These Sentences OK?• ジェインは店へ⾏った。

• は店⾏ったジェインは。

• ジェインは店へ⾏た。

• 店はジェインへ⾏った。

• 屋台はジェインのところへ⾏った。

Phenomena to Handle• Morphology

• Syntax

• Semantics/World Knowledge

• Discourse

• Pragmatics

• Multilinguality

Neural Nets for NLP

• Neural nets are a tool to do hard things!

• This class will give you the tools to handle the problems you want to solve in NLP.

Class Format/Structure

(Special Remote) Class Format

• Before class: Watch lecture video, often do reading

• During class:• Discussion: Gather in Zoom to discuss some

questions presented in the video • Code/Data Walk: The TAs (or instructor) will

sometimes walk through some demonstration code, data, or model predictions

• After class: Do quiz about material

Scope of Teaching• Basics of general neural network knowledge

-> Covered briefly (see reading and ask TAs if you are not familiar). Will have recitation.

• Advanced training techniques for neural networks -> Some coverage, like VAEs and adversarial training, mostly from the scope of NLP, not as much as other DL classes

• Advanced NLP-related neural network architectures -> Covered in detail

• Structured prediction and structured models in neural nets -> Covered in detail

• Implementation details salient to NLP -> Covered in detail

Assignments• Assignment 1 - Build-your-own Neural Network Toolkit:

Individually implement some parts of a neural network • Assignment 2 - Text Classifier / Questionnaire:

Individually implement a text classifier and fill in questionnaire on topics of interest

• Assignment 3 - SOTA Survey / Re-implementation: Re-implement and reproduce results from a recently published paper

• Assignment 4 - Final Project: Perform a unique project that either (1) improves on state-of-the-art, or (2) applies neural net models to a unique task

Instructors• Instructor: Graham Neubig (natural language analysis,

multilingual NLP, ML for NLP) • Co-Instructor: Pengfei Liu (text summarization, information

extraction, and interpretable evaluation) • TAs:

• Shuyan Zhou (natural language command and control) • Zhisong Zhang (syntax and shallow semantic analysis) • Divyansh Kaushik (robustness, causality, human-in-the-loop) • Zhengbao Jiang (knowledge and large language models) • Ritam Dutt (AI for social good, discourse and pragmatics)

• Piazza: http://piazza.com/cmu/spring2021/cs11747/home

Neural Networks: A Tool for Doing Hard Things

An Example Prediction Problem: Sentence Classification

I hate this movie

I love this movie

very good good

neutral bad

very bad

very good good

neutral bad

very bad

A First Try: Bag of Words (BOW)

I hate this movie

lookup lookup lookup lookup

+ + + +

scores

softmax

What do Our Vectors Represent?

• Each word has its own 5 elements corresponding to [very good, good, neutral, bad, very bad]

• “hate” will have a high value for “very bad”, etc.

Build It, Break It

There’s nothing I don’t love about this movie

very good good

neutral bad

very bad

I don’t love this movie

very good good

neutral bad

very bad

Combination Features

• Does it contain “don’t” and “love”?

• Does it contain “don’t”, “i”, “love”, and “nothing”?

Basic Idea of Neural Networks (for NLP Prediction Tasks)

I hate this movie

softmax

some complicated function to extract

combination features (neural net)

scores

Continuous Bag of Words (CBOW)

I hate this movie

scores

lookup lookup lookuplookup

• Each vector has “features” (e.g. is this an animate object? is this a positive word, etc.)

• We sum these features, then use these to make predictions

• Still no combination features: only the expressive power of a linear model, but dimension reduced

Deep CBOWI hate this movie

scores

+ + +=

tanh( W1*h + b1)

tanh( W2*h + b2)

• Now things are more interesting!

• We can learn feature combinations (a node in the second layer might be “feature 1 AND feature 5 are active”)

• e.g. capture things such as “not” AND “hate”

What is a Neural Net?: Computation Graphs

“Neural” NetsOriginal Motivation: Neurons in the Brain

Image credit: Wikipedia

Current Conception: Computation Graphs

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

f(u,v) = u · v

f(x1, x2, x3) =X

y = x>Ax+ b · x+ c

A node is a {tensor, matrix, vector, scalar} value

expression:

graph:

y = x>Ax+ b · x+ c

expression:

graph:

An edge represents a function argument (and also an data dependency). They are just pointers to nodes.A node with an incoming edge is a function of that edge’s tail node.

f(u) = u>

A node knows how to compute its value and the value of its derivative w.r.t each argument (edge) times a derivative of an arbitrary input .@F

@F@f(u)

y = x>Ax+ b · x+ c

f(u) = u>

f(U,V) = UV

expression:

graph:

Functions can be nullary, unary, binary, … n-ary. Often they are unary or binary.

y = x>Ax+ b · x+ c

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

expression:

graph:

Computation graphs are directed and acyclic (in DyNet)

y = x>Ax+ b · x+ c

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

f(x,A) = x>Ax

@f(x,A)

@A= xx>

@f(x,A)

@x= (A> +A)x

expression:

graph:

y = x>Ax+ b · x+ c

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

f(u,v) = u · v

f(x1, x2, x3) =X

expression:

graph:

y = x>Ax+ b · x+ c

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

f(u,v) = u · v

yf(x1, x2, x3) =

expression:

graph:

variable names are just labelings of nodes.

Algorithms (1)

• Graph construction

• Forward propagation

• In topological order, compute the value of the node given its inputs

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

f(u,v) = u · v

f(x1, x2, x3) =X

xigraph:

Forward Propagation

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

f(u,v) = u · v

f(x1, x2, x3) =X

xigraph:

Forward Propagation

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

f(u,v) = u · v

f(x1, x2, x3) =X

xigraph:

Forward Propagation

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

f(u,v) = u · v

f(x1, x2, x3) =X

xigraph:

Forward Propagation

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

f(u,v) = u · v

f(x1, x2, x3) =X

xigraph:

Forward Propagation

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

f(u,v) = u · v

f(x1, x2, x3) =X

xigraph:

b · x

Forward Propagation

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

f(u,v) = u · v

f(x1, x2, x3) =X

xigraph:

b · x

Forward Propagation

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

f(u,v) = u · v

f(x1, x2, x3) =X

xigraph:

b · x

Forward Propagation

x>Ax+ b · x+ c

Algorithms (2)• Back-propagation:

• Process examples in reverse topological order • Calculate the derivatives of the parameters with

respect to the final value (This is usually a “loss function”, a value we want to minimize)

• Parameter update:• Move the parameters in the direction of this

derivative W -= α * dl/dW

f(u) = u>

f(U,V) = UV

f(M,v) = Mv

f(u,v) = u · v

f(x1, x2, x3) =X

xigraph:

Back Propagation

Concrete Implementation Examples

Neural Network Frameworks

Basic Process in (Dynamic) Neural Network Frameworks

• Create a model

• For each example

• create a graph that represents the computation you want

• calculate the result of that computation

• if training, perform back propagation and update

Bag of Words (BOW)I hate this movie

+ + + +

scores

softmax

Continuous Bag of Words (CBOW)

I hate this movie

scores

lookup lookup lookuplookup

Deep CBOWI hate this movie

scores

+ + +=

tanh( W1*h + b1)

tanh( W2*h + b2)

Things to Remember Going Forward

Things to Remember• Neural nets are powerful!

• They are universal function approximators, can calculate any continuous function

• But language is hard, and data is limited.

• We need to design our networks to have inductive bias, to make it easy to learn things we’d like to learn.

Class Plan

Topic 1: Models of Sentences/Sequences

undeserved

• Bag of words, bag of n-grams • Convolutional nets • Recurrent neural networks and variations • Modeling documents and longer texts

this movie’s reputation is

Topic 2: Implementing, Debugging, and Interpreting

• Implementation: How to efficiently and effectively implement your models

• Debugging: How to find problems in your implemented models

• Interpretation: How to find why your model made a prediction?

Example: [Ribeiro+ 16]

Topic 3: Conditioned GenerationI hate this movie

LSTM LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM

この映画が嫌い

argmax

この映画argmax

がargmax

嫌いargmax

</s>argmax

• Encoder decoder models

• Attentional models, self-attention (Transformers)

Topic 4: Pre-trained Embeddings

• Pre-training word embeddings, contextualized word embeddings, sentence embeddings

• Design decisions in pre-training: model, data, objective

Topic 5: Structured Prediction Models

I hate this movie

LSTM LSTM LSTM LSTM

PRP VB DT NN• CRFs, and other marginalization-based training • REINFORCE, minimum risk training • Margin-based and search-based training methods • Advanced search algorithms

Topic 6: Models of Tree/Graph Structures

• Shift reduce, minimum spanning tree parsing

• Tree structured compositions

• Models of graph structures

I hate this movie

Topic 7: Advanced Learning Techniques

• Models with Latent Random Variables

• Adversarial Networks

• Semi-supervised and Unsupervised Learning

Topic 8: Knowledge-based and Text-based QA

• Learning and QA over knowledge graphs

• Machine reading and text-based QA

animal

dog cat

is-a is-a

Topic 9: Multi-task and Multilingual Learning

• Multi-task and transfer learning

• Multilingual learning of representations

I hate this movieこの映画が嫌い

PRP VB DT NN

Any Questions?

CS11-747 Neural Networks for NLP Introduction, Bag-of

Documents

CS11 – C++ DGC - California Institute of Technologycourses.cms.caltech.edu/cs11/material/dgc/lectures/cs11-cppdgc-lec... · Errors that can only be caught as the program executes

CS11 Abstracts

TREB CS11-7

CS11 – Introduction to C++users.cms.caltech.edu/~donnie/cs11/cpp/lectures/cs11-cpp... · 2012. 11. 8. · CS11 – Introduction to C++ Fall 2012-2013 Lecture 5 . Today’s Topics

CS11-747 Neural Networks for NLP Machine …phontron.com/.../assets/slides/nn4nlp-23-machinereading.pdfAttention-over-attention (Cui et al. 2017) • Idea: we want to know the document

CS11-711 Advanced NLP PCFG Parsing

CS11-747 Neural Networks for NLP Unsupervised and Semi ...phontron.com/class/nn4nlp2017/assets/slides/nn4nlp-18-unsup.pdf · Learning Dependency Heads w/ Attention (Kuncoro et al

CS11 – Javausers.cms.caltech.edu/~donnie/cs11/java/lectures/cs11-java-lec7.pdf · java.util.Vector, java.util.Hashtable ! Both classes synchronize every single method! ! Don’t

CS11 – Java - Caltech Computingcourses.cms.caltech.edu/.../java/donnie/lectures/cs11-java-lec6.pdf · CS11 – Java Fall 2014-2015 Lecture 6 . Today’s Topics ! Lab 6: Web Crawler!

CS11-711 Advanced NLP NLP Experimental Design

CS11 – Advanced C++courses.cms.caltech.edu/cs11/material/advcpp/lectures/cs11-advcpp-lec8.pdfThe volatile Keyword Can mark a variable volatile: class Singleton { static volatile

CS11-747 Neural Networks for NLP …phontron.com/class/nn4nlp2017/assets/slides/nn4nlp-07...Textual Entailment (Dagan et al. 2006, Marelli et al. 2014) • Entailment: if A is true,

CS11-747 Neural Networks for NLP Learning From/For ...phontron.com/class/nn4nlp2017/assets/slides/nn4nlp-21-kb.pdf · CS11-747 Neural Networks for NLP Learning From/For Knowledge

CS11.Supply Chain Management

CS11-747 Neural Networks for NLP Reinforcement Learning ... · reinforcement learning in NLP (Survey: Young et al. 2013) • Standard tools: Markov decision processes, partially observed

CS11-737 Multilingual NLP Machine Translation/ Sequence-to-sequence …demo.clab.cs.cmu.edu/11737fa20/slides/multiling-07-mt... · 2020. 9. 24. · I hate this movie kono eiga ga

CS11 Advanced C++ - Caltech Computing + …courses.cms.caltech.edu/cs11/material/advcpp/lectures/cs...Welcome to CS11 Advanced C++! •A deeper dive into C++ programming language topics

CS11-747 Neural Networks for NLP Attentionphontron.com/class/nn4nlp2017/assets/slides/nn4nlp-09...Basic Idea (Bahdanau et al. 2015) • Encode each word in the sentence into a vector

CS11-711 Advanced NLP Introduction to Natural Language

CS11 – Introduction to C++users.cms.caltech.edu/~donnie/cs11-old/cs11-cpp-lec3.pdf · implementation. constis enforced ... // Some complex numbers! complex c2(-2, 4); complex c3