27: Intro to Deep Learning + Wrap-up

27: Intro to Deep Learning + Wrap-upLisa YanJune 8, 2020

1

Lisa Yan, CS109, 2020

Quick slide reference

2

3 0.1% of Deep Learning LIVE

39 Beyond the Basics LIVE

63 CS109 Wrap-Up LIVE

Deep Learning

3

LIVE


Logistic Regression Model

4

!! +#"#$

%!"$"! " # = 1|!

sigmoid function

! " = 11 + &!" 0

0.2

0.4

0.6

0.8

1

-10 -8 -6 -4 -2 0 2 4 6 8 10

! "

"

+

!"#

" # = 1|! = '

arg max'( ),+

" # | !

Not actually part of modelingonly part of prediction

(so ignore for now)


Big ideas from Logistic Regression

5

!! +#"#$

%!"$"! -" # = 1|!

sigmoid function

! " = 11 + &!" 0

0.2

0.4

0.6

0.8

1

-10 -8 -6 -4 -2 0 2 4 6 8 10

! "

"

+

…

σ

!

"#" # = 1|! = '

> 0.5?


One neuron

6

+

…

σ

!

"#=

= One logistic regression


Neural network =many logisticregressions

Biological basis for neural networks

7

A neuron

Your brain

One neuron =one logisticregression

(actually, probably someone else’s brain)

x1x2x3x4

θ1

θ2θ3θ4

y

x1x2x3x4


Innovations in deep learning

Deep learning (neural networks) is the core idea behind the current revolution in AI.

8AlphaGO (2016)


Computers making art

9


Detecting skin cancer

10

Esteva, Andre, et al. "Dermatologist-level classification of skin cancer with deep neural networks." Nature 542.7639 (2017): 115-118.


Deep learning

def Deep learning is maximum likelihood estimation with neural networks.

11

[1,0, … , 1]!, input

"#, output* +|- = !

> 0.5? Yes.Predict 1

Lots of Logistic (regressions)

LOL

def A neural network is(at its core) many logistic regression pieces stacked on top of each other.


Digit recognition example

12

Input feature vector

'(-) = 0,0,0,0, … , 1,0,0,1, … , 0,0,1,0

'(-) = 0,0,1,1, … , 0,1,1,0, … , 0,1,0,0

1 - = 0

1 - = 1

Output labelInput image

We make feature vectors from (digitized) pictures of numbers.


Logistic Regression

13

…

!, input features

"#, output+ σ

!" = σ θ!&(pixels, on/off)

" # = 1|! = '


Logistic Regression

14

…

> 0.5?

!, input features

"#, output

!" = σ θ!&

indicates logistic regression connection No.

Predict 0

(pixels, on/off)

" # = 1|! = '


Logistic Regression

15

…

> 0.5?

!, input features

"#, output

!" = σ θ!&

indicates logistic regression connection Yes.

Predict 1


Logistic Regression

16

…

> 0.5?

!, input features

"#, output

!" = σ θ!&

indicates logistic regression connection Yes.

Predict 1

What can we do to increasecomplexity of our model?


Big ideas from logistic regression

17

…

!, input features

"#, output

!" = σ θ!&

indicates logistic regression connection

Big idea #1 " #|! = 'Model conditional probability 21 of class label given input

Big idea #2 σ θ/'Non-linear transform of multiple values into one value, using parameter θ


Neural network

18

…

!, input features

"#, output

$, hiddenlayer

> 0.5?No. Predict 0


Neural network

19

…

!, input features

"#, output

Big idea #1 " #|! = 'Model conditional probability 21 of class label given input

$, hiddenlayer

> 0.5?No. Predict 0


Feed neurons into other neurons

20

…

$, hiddenlayer

+ σ

hiddenneuron

• Neuron = logistic regression

!, input features

"#, output

> 0.5?No. Predict 0

Big idea #2 σ θ/'Non-linear transform of multiple values into one value, using parameter θ



21

…

$, hiddenlayer

+ σ

anotherhiddenneuron

• Neuron = logistic regression• Different parameters for

every connection!, input features

"#, output

> 0.5?No. Predict 0



22

…

!, input features

"#, output

$, hiddenlayer

|5| logistic regression

connections


every connection

> 0.5?No. Predict 0

' ⋅ |5|parameters



23

…

$, hiddenlayer

> 0.5?No. Predict 0+ σ

outputneuron


every connection!, input features

"#, output


connections

' ⋅ |5|parameters



24

…

!, input features

"#, output

$, hiddenlayer

> 0.5?No. Predict 0

1 logistic regression connection


connections

7 ⋅ |5|parameters

|5|parameters

• Neural networks are simplycomposed logistic regression units.

• 1 neuron = 1 logistic regression.• Each neuron has its own set of

parameters.


Demonstration

http://scs.ryerson.ca/~aharley/vis/conv/

25

http://scs.ryerson.ca/~aharley/vis/conv/


Neural networks

26

Testing/Prediction

Training

A neural network (like logistic regression) gets intelligence from its parameters &.

• Learn parameters &• Find &%&' that maximizes likelihood of

training data (MLE)

For input feature vector ' = !:• Use parameters to compute "# = ) * = 1|' = !• Classify instance as: -1 "# > 0.5

0 otherwise


Neural networks

27

Training

A neural network (like logistic regression) gets intelligence from its parameters &.

• Learn parameters &• Find &%&' that maximizes likelihood of

training data (MLE)

How do we learn the ! ⋅ 0 + |0| parameters?Gradient ascent + chain rule!


1. Optimizationproblem:

2. Compute gradient

3. Optimize

Training Logistic Regression

28

899 :

8:0=;

-(+

11(-) − 21 =0

(-)

:234 = arg max5

>-(+

1? 1 - | ' - , : = arg max

599 :

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -

Review

21 = C :/'(-)

initialize paramsrepeat many times:

compute gradientparams += η * gradient



2. Compute gradient

3. Optimize

Training a neural net

29

:234 = arg max5

>-(+

1? 1 - | ' - , : = arg max

599 :


1. Same output !", same conditional log-likelihood

30

…

!, input features

"#, output

$, hiddenlayer

) * = #|' = ! = -"# if # = 11 − "# if # = 0

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -

9 : =>-(+

1" # = 1 - |! = ' - , :

=>-(+

121 - ' !

1 − 21 - +6' !


1. (model is a little more complicated, though)

31

…

!, input features

"#, output

$, hiddenlayer

( ) = 1|- = .

"# = 3 & /0 1$for D = 1,… , |5|

) * = #|' = ! = -"# if # = 11 − "# if # = 0

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -

ℎ2 = 3 &2 31!

5 param vectors, each with dimension ' 5 -dimensional param vector



2. Compute gradient

3. Optimize

2. Compute gradient

32

:234 = arg max5

>-(+

1? 1 - | ' - , : = arg max

599 :

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -

ℎ0 = C :07 /

' 21 = C : 8' /5for D = 1,… , |5|

Calculus refresher #1:Derivative(sum) =

sum(derivative)

Calculus refresher #2:Chain rule !!!

Take gradient with respect to all : parameters


ℎ0 = C :07 /

'


2. Compute gradient

3. Optimize

3. Optimize

33

:234 = arg max5

>-(+

1? 1 - | ' - , : = arg max

599 :

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -



21 = C : 8' /5for D = 1,… , |5|



ℎ0 = C :07 /

' 21 = C : 8' /5for D = 1,… , |5|



2. Compute gradient

3. Optimize

Training a neural net

34

:234 = arg max5

>-(+

1? 1 - | ' - , : = arg max

599 :

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -



Wait, did we just skip something difficult?



2. Compute gradient

3. Optimize

2. Compute gradient via backpropagation

35

:234 = arg max5

>-(+

1? 1 - | ' - , : = arg max

599 :

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -



Learn the tricks behind backpropagation in CS229, CS231N, CS224N, etc.

ℎ0 = C :07 /

' 21 = C : 8' /5for D = 1,… , |5|


Interlude for jokes

36


Probability as college students

37(Note: not a valid PDF, but a useful construct nonetheless)


All remaining jokes

38

Knock knock. Whose there. Dishes. Dishes who? Dishes a great class.

Tell a joke about pizza! Or wait that might be too cheesy ;) Why do you avoid the top of

the bell curve? Because everyone there is mean.

Why is your nose in the middle of your face? Because it's the scenter :)

Two random variables were talking in a bar. They thought they were being discreet but I heard their chatter continuously.

What happens to a frog’s car when it breaks down? It gets toad away.

“A statistics professor plans to travel to a conference by plane. When he passes the security check, they discover a bomb in his carry-on! Of course, he is hauled off immediately for interrogation. "I don't understand!" the interrogating officer exclaims. "You're an accomplished

Lisa giving us a house tour.

“Three logicians walk into a bar. The bartender asks, "Will you all have a drink?" The first logician says "I don't know." The second logician says "I don't know." The third logician says "Yes.””

What did the fish say when he swam into a wall? Dam

What do you call it when you've been pulling too many all-nighters studying probability, and you see the ghost of Carl Friedrich Gauss? ... ... ... ... para-NORMAL activity(I'll take my extra credit now please)

What is similar about ice cream and bees? If you eat too many of either you’ll get a stomachache

Why does the Norwegian Navy have bar codes on the side of their ships? So when they come back to port they can...

what has two butts and kills people? an

(from mid-quarter feedback)

Beyond the basics

39

LIVE


Shared weights?

40

It turns out if you want to force some of your weights to be shared over different neurons, the math isn’t much harder.Convolution is an example of such weight-sharing and is used a lot for vision (Convolutional Neural Networks, CNN).


Neural networks with multiple layers

41

' F G H I J K LM 99


Neurons learn features of the dataset

42Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

Neurons in later layers will respond strongly to high-level features of your training data.If your training data is faces, you will get lots of face neurons.

If your training datais all of YouTube…

Top stimuli in test set Optimal stimulus foundby numerical optimization

…you get a cat neuron.

Lisa Yan, CS109, 2019 43


Multiple outputs?

Softmax is a generalization of the sigmoid function.

44

N ∈ ℝ:" # = 1|! = ' = C N

R ∈ ℝ9:" # = S|! = ' = softmax R -

softmax 4 : 5-dimensional values in range[0,1] that add up to 1

sigmoid 4 : value in range [0, 1]

(equivalent: Bernoulli 6)

(equivalent: Multinomial 6!, … , 6")


Softmax test metric: Top-5 error

45

# = 1 " # = 1|! = '5 0.148 0.137 0.122 0.109 0.104 0.091 0.090 0.096 0.083 0.05

Top-5 classification errorWhat % of datapointsdid not have the correct class label in the top-5predictions?

(class label: 5)

(probabilities of predictions)


ImageNet classification

46

22,000 categories

14,000,000 images

Hand-engineered features (SIFT, HOG, LBP), Spatial pyramid, SparseCoding/Compression

Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

…smoothhound, smoothhound shark, Mustelus mustelusAmerican smooth dogfish, Mustelus canisFlorida smoothhound, Mustelus norrisiwhitetip shark, reef whitetip shark, Triaenodon obseusAtlantic spiny dogfish, Squalus acanthiasPacific spiny dogfish, Squalus suckleyihammerhead, hammerhead sharksmooth hammerhead, Sphyrna zygaenasmalleye hammerhead, Sphyrna tudesshovelhead, bonnethead, bonnet shark, Sphyrna tiburoangel shark, angelfish, Squatina squatina, monkfishelectric ray, crampfish, numbfish, torpedosmalltooth sawfish, Pristis pectinatusguitarfishroughtail stingray, Dasyatis centrourabutterfly rayeagle rayspotted eagle ray, spotted ray, Aetobatus narinaricownose ray, cow-nosed ray, Rhinoptera bonasusmanta, manta ray, devilfishAtlantic manta, Manta birostrisdevil ray, Mobula hypostomagrey skate, gray skate, Raja batislittle skate, Raja erinacea…

Stingray

Mantaray


ImageNet classification challenge

47

22,000 categories

14,000,000 images

Hand-engineered features (SIFT, HOG, LBP), Spatial pyramid, SparseCoding/Compression

Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

…smoothhound, smoothhound shark, Mustelus mustelusAmerican smooth dogfish, Mustelus canisFlorida smoothhound, Mustelus norrisiwhitetip shark, reef whitetip shark, Triaenodon obseusAtlantic spiny dogfish, Squalus acanthiasPacific spiny dogfish, Squalus suckleyihammerhead, hammerhead sharksmooth hammerhead, Sphyrna zygaenasmalleye hammerhead, Sphyrna tudesshovelhead, bonnethead, bonnet shark, Sphyrna tiburoangel shark, angelfish, Squatina squatina, monkfishelectric ray, crampfish, numbfish, torpedosmalltooth sawfish, Pristis pectinatusguitarfishroughtail stingray, Dasyatis centrourabutterfly rayeagle rayspotted eagle ray, spotted ray, Aetobatus narinaricownose ray, cow-nosed ray, Rhinoptera bonasusmanta, manta ray, devilfishAtlantic manta, Manta birostrisdevil ray, Mobula hypostomagrey skate, gray skate, Raja batislittle skate, Raja erinacea…

1000 categories

200,000 images in test set1,200,000 images in train set


ImageNet challenge: Top-5 classification error

48

99.5%Random guess

" true class label not in 5 guesses =

9995

10005

=995

1000

(lower is better)



49

99.5%Random guess

25.8%

?

Pre-Neural Networks

GoogLe Net(2015)

16.4%

5.1%Humans(2014)

(lower is better)

Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge. IJCV 2015Szegedy et al., Going Deeper With Convolutions. CVPR 2015Hu et al., Squeeze-and-Excitation Networks. Preprint arXiV 2017



50

99.5%Random guess

25.8%

?

Pre-Neural Networks

GoogLe Net(2015)

16.4%

5.1%Humans(2014)

SENet(2017)

2.25%

Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge. IJCV 2015Szegedy et al., Going Deeper With Convolutions. CVPR 2015Hu et al., Squeeze-and-Excitation Networks. Preprint arXiV 2017

(lower is better)


GoogLeNet (2015)

51

1 Trillion Artificial Neurons(btw human brains have 1 billion neurons)

22 layers deep!

Multiple,Multi class output

Szegedy et al., Going Deeper With Convolutions. CVPR 2015


Good ML = Generalization

Goal of machine learning: build models that generalize wellto predicting new dataOverfitting: fitting the training data too well,so we lose generality of model

• Polynomial on the right fits training data perfectly!• Which would you rather use to predict a new data point?

52


Prevent overfitting?

53

Dropout (training technique)When your model is training, randomly turn off your neurons with probability 0.5. It will make your network more robust.


Making decisions?

54

Not everything is classification.

Deep Reinforcement LearningInstead of having the output of a model be a probability, you make output an expectation.

http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html



Deep Reinforcement Learning

55


Deep Mind Atari GamesScore compared to best human


What’s missing?

56

Where is your data coming from?

57


Ethics and datasets

Sometimes machine learning feels universally unbiased.We can even prove our estimators are “unbiased” (mathematically).Google/Nikon/HP had biased datasets.

58


Should your data be unbiased?

59

Man is to Computer Programmer as Woman is to Homemaker?

Debiasing Word Embeddings

Tolga Bolukbasi1, Kai-Wei Chang2, James Zou2, Venkatesh Saligrama1,2, Adam Kalai21Boston University, 8 Saint Mary’s Street, Boston, MA

2Microsoft Research New England, 1 Memorial Drive, Cambridge, [email protected], [email protected], [email protected], [email protected], [email protected]

AbstractThe blind application of machine learning runs the risk of amplifying biases present in data. Such a

danger is facing us with word embedding, a popular framework to represent text data as vectors whichhas been used in many machine learning and natural language processing tasks. We show that evenword embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbingextent. This raises concerns because their widespread use, as we describe, often tends to amplify thesebiases. Geometrically, gender bias is first shown to be captured by a direction in the word embedding.Second, gender neutral words are shown to be linearly separable from gender definition words in the wordembedding. Using these properties, we provide a methodology for modifying an embedding to removegender stereotypes, such as the association between between the words receptionist and female, whilemaintaining desired associations such as between the words queen and female. We define metrics toquantify both direct and indirect gender biases in embeddings, and develop algorithms to “debias” theembedding. Using crowd-worker evaluation as well as standard benchmarks, we empirically demonstratethat our algorithms significantly reduce gender bias in embeddings while preserving the its useful propertiessuch as the ability to cluster related concepts and to solve analogy tasks. The resulting embeddings canbe used in applications without amplifying gender bias.

1 Introduction

There have been hundreds or thousands of papers written about word embeddings and their applications,from Web search [27] to parsing Curriculum Vitae [16]. However, none of these papers have recognized howblatantly sexist the embeddings are and hence risk introducing biases of various types into real-world systems.

A word embedding that represent each word (or common phrase) w as a d-dimensional word vector~w 2 Rd. Word embeddings, trained only on word co-occurrence in text corpora, serve as a dictionary of sortsfor computer programs that would like to use word meaning. First, words with similar semantic meaningstend to have vectors that are close together. Second, the vector differences between words in embeddingshave been shown to represent relationships between words [32, 26]. For example given an analogy puzzle,“man is to king as woman is to x” (denoted as man:king :: woman:x), simple arithmetic of the embeddingvectors finds that x=queen is the best answer because:

��!man ��!woman ⇡��!king ��!queen

Similarly, x=Japan is returned for Paris:France :: Tokyo:x. It is surprising that a simple vector arithmeticcan simultaneously capture a variety of relationships. It has also excited practitioners because such a toolcould be useful across applications involving natural language. Indeed, they are being studied and usedin a variety of downstream applications (e.g., document ranking [27], sentiment analysis [18], and questionretrieval [22]).

However, the embeddings also pinpoint sexism implicit in text. For instance, it is also the case that:

��!man ��!woman ⇡ ��!computer programmer ��!homemaker.

1

arX

iv:1

607.

0652

0v1

[cs.C

L] 2

1 Ju

l 201

6

Man is to Computer Programmer as Woman is to Homemaker?

Debiasing Word Embeddings

Tolga Bolukbasi1, Kai-Wei Chang2, James Zou2, Venkatesh Saligrama1,2, Adam Kalai21Boston University, 8 Saint Mary’s Street, Boston, MA

2Microsoft Research New England, 1 Memorial Drive, Cambridge, [email protected], [email protected], [email protected], [email protected], [email protected]

AbstractThe blind application of machine learning runs the risk of amplifying biases present in data. Such a

danger is facing us with word embedding, a popular framework to represent text data as vectors whichhas been used in many machine learning and natural language processing tasks. We show that evenword embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbingextent. This raises concerns because their widespread use, as we describe, often tends to amplify thesebiases. Geometrically, gender bias is first shown to be captured by a direction in the word embedding.Second, gender neutral words are shown to be linearly separable from gender definition words in the wordembedding. Using these properties, we provide a methodology for modifying an embedding to removegender stereotypes, such as the association between between the words receptionist and female, whilemaintaining desired associations such as between the words queen and female. We define metrics toquantify both direct and indirect gender biases in embeddings, and develop algorithms to “debias” theembedding. Using crowd-worker evaluation as well as standard benchmarks, we empirically demonstratethat our algorithms significantly reduce gender bias in embeddings while preserving the its useful propertiessuch as the ability to cluster related concepts and to solve analogy tasks. The resulting embeddings canbe used in applications without amplifying gender bias.

1 Introduction

There have been hundreds or thousands of papers written about word embeddings and their applications,from Web search [27] to parsing Curriculum Vitae [16]. However, none of these papers have recognized howblatantly sexist the embeddings are and hence risk introducing biases of various types into real-world systems.

A word embedding that represent each word (or common phrase) w as a d-dimensional word vector~w 2 Rd. Word embeddings, trained only on word co-occurrence in text corpora, serve as a dictionary of sortsfor computer programs that would like to use word meaning. First, words with similar semantic meaningstend to have vectors that are close together. Second, the vector differences between words in embeddingshave been shown to represent relationships between words [32, 26]. For example given an analogy puzzle,“man is to king as woman is to x” (denoted as man:king :: woman:x), simple arithmetic of the embeddingvectors finds that x=queen is the best answer because:

��!man ��!woman ⇡��!king ��!queen

Similarly, x=Japan is returned for Paris:France :: Tokyo:x. It is surprising that a simple vector arithmeticcan simultaneously capture a variety of relationships. It has also excited practitioners because such a toolcould be useful across applications involving natural language. Indeed, they are being studied and usedin a variety of downstream applications (e.g., document ranking [27], sentiment analysis [18], and questionretrieval [22]).

However, the embeddings also pinpoint sexism implicit in text. For instance, it is also the case that:

��!man ��!woman ⇡ ��!computer programmer ��!homemaker.

1

arX

iv:1

607.

0652

0v1

[cs.C

L] 2

1 Ju

l 201

6

Dataset: Google News

Bolukbasi et al., Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. NIPS 2016

Should our unbiased data collection reflect society’s systemic bias?


How can we explain decisions?

60

If your task is image classification,reasoning about high-level features is relatively easy.Everything can be visualized.

What if you are trying to classify social outcomes?

• Criminal recidivism• Job performance• Policing • Terrorist risk• At-risk kids

Ethics in Machine Learning is a whole new field. !

61

CS109 Wrap-Up

62

LIVE

63

What have we learned in CS109?


A wild journey

64

Computer science Probability


From combinatorics to probability…

65

) 5 + ) 59 = 1


…to random variables and the Central Limit Theorem…

66

Bernoulli

Poisson

Gaussian


…to statistics, parameter estimation, and machine learning

67

A happyBhutanese person

77 & Flu Under-grad

TiredFever

and Learn


Lots and lots of analysis

68

Climate sensitivity

Bloom filters

Peer grading

Biometric keystroke recognition

Web MD inferencep-hacking?

Coursera A/B testing


Lots and lots of analysis

69

Heart

Ancestry Netflix


After CS109

70

Statistics

Stats 200 – Statistical InferenceStats 208 – Intro to the BootstrapStats 209 – Group Methods/Causal Inference

Theory

CS161 – Algorithmic analysisStats 217 – Stochastic ProcessesCS238 – Decision Making Under UncertaintyCS228 – Probabilistic Graphical Models


After CS109

71

AI

CS 221 – Intro to AICS 229 – Machine LearningCS 230 – Deep LearningCS 224N – Natural Language ProcessingCS 231N – Conv Neural Nets for Visual RecognitionCS 234 – Reinforcement Learning

Applications

CS 279 – Bio ComputationLiterally any class with numbers in it

72

What have we done together this quarter?


The CS109 teaching team

73

Lisa Yan, CS109, 2020 74

Section

Python session Overleaf tutorial

Discussion, synchronous and asynchronous


Tea hours

75

Turns out jigsaw puzzles aren’t probabilistic!

76

What have you learned from CS109?

77

What do you want toremember in 5 years?


What do you want to remember in the next 5 years?

78



79



80

81

Why study probability + CS?



82

Source: US Bureau of Labor Statistics

https://www.bls.gov/ooh/fastest-growing.htm



83

Interdisciplinary Closest thing to magic



84

Everyone is welcome!

85

Technology magnifies.What do we want

magnified?

86

You are all one step closer to improving the world.

(all of you!)

87

The end

See you soon… J

Documents

27: Intro to Deep Learning + Wrap-up