87
27: Intro to Deep Learning + Wrap-up Lisa Yan June 8, 2020 1

27: Intro to Deep Learning + Wrap-up

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 27: Intro to Deep Learning + Wrap-up

27: Intro to Deep Learning + Wrap-upLisa YanJune 8, 2020

1

Page 2: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Quick slide reference

2

3 0.1% of Deep Learning LIVE

39 Beyond the Basics LIVE

63 CS109 Wrap-Up LIVE

Page 3: 27: Intro to Deep Learning + Wrap-up

Deep Learning

3

LIVE

Page 4: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Logistic Regression Model

4

!! +#"#$

%!"$"! " # = 1|!

sigmoid function

! " = 11 + &!" 0

0.2

0.4

0.6

0.8

1

-10 -8 -6 -4 -2 0 2 4 6 8 10

! "

"

+

!"#

" # = 1|! = '

arg max'( ),+

" # | !

Not actually part of modelingonly part of prediction

(so ignore for now)

Page 5: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Big ideas from Logistic Regression

5

!! +#"#$

%!"$"! -" # = 1|!

sigmoid function

! " = 11 + &!" 0

0.2

0.4

0.6

0.8

1

-10 -8 -6 -4 -2 0 2 4 6 8 10

! "

"

+

σ

!

"#" # = 1|! = '

> 0.5?

Page 6: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

One neuron

6

+

σ

!

"#=

= One logistic regression

Page 7: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Neural network =many logisticregressions

Biological basis for neural networks

7

A neuron

Your brain

One neuron =one logisticregression

(actually, probably someone else’s brain)

x1x2x3x4

θ1

θ2θ3θ4

y

x1x2x3x4

Page 8: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Innovations in deep learning

Deep learning (neural networks) is the core idea behind the current revolution in AI.

8AlphaGO (2016)

Page 9: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Computers making art

9

Page 10: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Detecting skin cancer

10

Esteva, Andre, et al. "Dermatologist-level classification of skin cancer with deep neural networks." Nature 542.7639 (2017): 115-118.

Page 11: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Deep learning

def Deep learning is maximum likelihood estimation with neural networks.

11

[1,0, … , 1]!, input

"#, output* +|- = !

> 0.5? Yes.Predict 1

Lots of Logistic (regressions)

LOL

def A neural network is(at its core) many logistic regression pieces stacked on top of each other.

Page 12: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Digit recognition example

12

Input feature vector

'(-) = 0,0,0,0, … , 1,0,0,1, … , 0,0,1,0

'(-) = 0,0,1,1, … , 0,1,1,0, … , 0,1,0,0

1 - = 0

1 - = 1

Output labelInput image

We make feature vectors from (digitized) pictures of numbers.

Page 13: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Logistic Regression

13

!, input features

"#, output+ σ

!" = σ θ!&(pixels, on/off)

" # = 1|! = '

Page 14: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Logistic Regression

14

> 0.5?

!, input features

"#, output

!" = σ θ!&

indicates logistic regression connection No.

Predict 0

(pixels, on/off)

" # = 1|! = '

Page 15: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Logistic Regression

15

> 0.5?

!, input features

"#, output

!" = σ θ!&

indicates logistic regression connection Yes.

Predict 1

Page 16: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Logistic Regression

16

> 0.5?

!, input features

"#, output

!" = σ θ!&

indicates logistic regression connection Yes.

Predict 1

What can we do to increasecomplexity of our model?

Page 17: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Big ideas from logistic regression

17

!, input features

"#, output

!" = σ θ!&

indicates logistic regression connection

Big idea #1 " #|! = 'Model conditional probability 21 of class label given input

Big idea #2 σ θ/'Non-linear transform of multiple values into one value, using parameter θ

Page 18: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Neural network

18

!, input features

"#, output

$, hiddenlayer

> 0.5?No. Predict 0

Page 19: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Neural network

19

!, input features

"#, output

Big idea #1 " #|! = 'Model conditional probability 21 of class label given input

$, hiddenlayer

> 0.5?No. Predict 0

Page 20: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Feed neurons into other neurons

20

$, hiddenlayer

+ σ

hiddenneuron

• Neuron = logistic regression

!, input features

"#, output

> 0.5?No. Predict 0

Big idea #2 σ θ/'Non-linear transform of multiple values into one value, using parameter θ

Page 21: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Feed neurons into other neurons

21

$, hiddenlayer

+ σ

anotherhiddenneuron

• Neuron = logistic regression• Different parameters for

every connection!, input features

"#, output

> 0.5?No. Predict 0

Page 22: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Feed neurons into other neurons

22

!, input features

"#, output

$, hiddenlayer

|5| logistic regression

connections

• Neuron = logistic regression• Different parameters for

every connection

> 0.5?No. Predict 0

' ⋅ |5|parameters

Page 23: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Feed neurons into other neurons

23

$, hiddenlayer

> 0.5?No. Predict 0+ σ

outputneuron

• Neuron = logistic regression• Different parameters for

every connection!, input features

"#, output

|5| logistic regression

connections

' ⋅ |5|parameters

Page 24: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Feed neurons into other neurons

24

!, input features

"#, output

$, hiddenlayer

> 0.5?No. Predict 0

1 logistic regression connection

|5| logistic regression

connections

7 ⋅ |5|parameters

|5|parameters

• Neural networks are simplycomposed logistic regression units.

• 1 neuron = 1 logistic regression.• Each neuron has its own set of

parameters.

Page 25: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Demonstration

http://scs.ryerson.ca/~aharley/vis/conv/

25

Page 26: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Neural networks

26

Testing/Prediction

Training

A neural network (like logistic regression) gets intelligence from its parameters &.

• Learn parameters &• Find &%&' that maximizes likelihood of

training data (MLE)

For input feature vector ' = !:• Use parameters to compute "# = ) * = 1|' = !• Classify instance as: -1 "# > 0.5

0 otherwise

Page 27: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Neural networks

27

Training

A neural network (like logistic regression) gets intelligence from its parameters &.

• Learn parameters &• Find &%&' that maximizes likelihood of

training data (MLE)

How do we learn the ! ⋅ 0 + |0| parameters?Gradient ascent + chain rule!

Page 28: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

1. Optimizationproblem:

2. Compute gradient

3. Optimize

Training Logistic Regression

28

899 :

8:0=;

-(+

11(-) − 21 =0

(-)

:234 = arg max5

>-(+

1? 1 - | ' - , : = arg max

599 :

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -

Review

21 = C :/'(-)

initialize paramsrepeat many times:

compute gradientparams += η * gradient

Page 29: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

1. Optimizationproblem:

2. Compute gradient

3. Optimize

Training a neural net

29

:234 = arg max5

>-(+

1? 1 - | ' - , : = arg max

599 :

Page 30: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

1. Same output !", same conditional log-likelihood

30

!, input features

"#, output

$, hiddenlayer

) * = #|' = ! = -"# if # = 11 − "# if # = 0

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -

9 : =>-(+

1" # = 1 - |! = ' - , :

=>-(+

121 - ' !

1 − 21 - +6' !

Page 31: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

1. (model is a little more complicated, though)

31

!, input features

"#, output

$, hiddenlayer

( ) = 1|- = .

"# = 3 & /0 1$for D = 1,… , |5|

) * = #|' = ! = -"# if # = 11 − "# if # = 0

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -

ℎ2 = 3 &2 31!

5 param vectors, each with dimension ' 5 -dimensional param vector

Page 32: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

1. Optimizationproblem:

2. Compute gradient

3. Optimize

2. Compute gradient

32

:234 = arg max5

>-(+

1? 1 - | ' - , : = arg max

599 :

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -

ℎ0 = C :07 /

' 21 = C : 8' /5for D = 1,… , |5|

Calculus refresher #1:Derivative(sum) =

sum(derivative)

Calculus refresher #2:Chain rule !!!

Take gradient with respect to all : parameters

Page 33: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

ℎ0 = C :07 /

'

1. Optimizationproblem:

2. Compute gradient

3. Optimize

3. Optimize

33

:234 = arg max5

>-(+

1? 1 - | ' - , : = arg max

599 :

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -

initialize paramsrepeat many times:

compute gradientparams += η * gradient

21 = C : 8' /5for D = 1,… , |5|

Take gradient with respect to all : parameters

Page 34: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

ℎ0 = C :07 /

' 21 = C : 8' /5for D = 1,… , |5|

Take gradient with respect to all : parameters

1. Optimizationproblem:

2. Compute gradient

3. Optimize

Training a neural net

34

:234 = arg max5

>-(+

1? 1 - | ' - , : = arg max

599 :

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -

initialize paramsrepeat many times:

compute gradientparams += η * gradient

Wait, did we just skip something difficult?

Page 35: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

1. Optimizationproblem:

2. Compute gradient

3. Optimize

2. Compute gradient via backpropagation

35

:234 = arg max5

>-(+

1? 1 - | ' - , : = arg max

599 :

99 : =;-(+

11(-) log 21 - + 1 − 1(-) log 1 − 21 -

initialize paramsrepeat many times:

compute gradientparams += η * gradient

Learn the tricks behind backpropagation in CS229, CS231N, CS224N, etc.

ℎ0 = C :07 /

' 21 = C : 8' /5for D = 1,… , |5|

Take gradient with respect to all : parameters

Page 36: 27: Intro to Deep Learning + Wrap-up

Interlude for jokes

36

Page 37: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Probability as college students

37(Note: not a valid PDF, but a useful construct nonetheless)

Page 38: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

All remaining jokes

38

Knock knock. Whose there. Dishes. Dishes who? Dishes a great class.

Tell a joke about pizza! Or wait that might be too cheesy ;) Why do you avoid the top of

the bell curve? Because everyone there is mean.

Why is your nose in the middle of your face? Because it's the scenter :)

Two random variables were talking in a bar. They thought they were being discreet but I heard their chatter continuously.

What happens to a frog’s car when it breaks down? It gets toad away.

“A statistics professor plans to travel to a conference by plane. When he passes the security check, they discover a bomb in his carry-on! Of course, he is hauled off immediately for interrogation. "I don't understand!" the interrogating officer exclaims. "You're an accomplished

Lisa giving us a house tour.

“Three logicians walk into a bar. The bartender asks, "Will you all have a drink?" The first logician says "I don't know." The second logician says "I don't know." The third logician says "Yes.””

What did the fish say when he swam into a wall? Dam

What do you call it when you've been pulling too many all-nighters studying probability, and you see the ghost of Carl Friedrich Gauss? ... ... ... ... para-NORMAL activity(I'll take my extra credit now please)

What is similar about ice cream and bees? If you eat too many of either you’ll get a stomachache

Why does the Norwegian Navy have bar codes on the side of their ships? So when they come back to port they can...

what has two butts and kills people? an

(from mid-quarter feedback)

Page 39: 27: Intro to Deep Learning + Wrap-up

Beyond the basics

39

LIVE

Page 40: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

Shared weights?

40

It turns out if you want to force some of your weights to be shared over different neurons, the math isn’t much harder.Convolution is an example of such weight-sharing and is used a lot for vision (Convolutional Neural Networks, CNN).

Page 41: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

Neural networks with multiple layers

41

' F G H I J K LM 99

Page 42: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

Neurons learn features of the dataset

42Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

Neurons in later layers will respond strongly to high-level features of your training data.If your training data is faces, you will get lots of face neurons.

If your training datais all of YouTube…

Top stimuli in test set Optimal stimulus foundby numerical optimization

…you get a cat neuron.

Page 43: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019 43

Page 44: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

Multiple outputs?

Softmax is a generalization of the sigmoid function.

44

N ∈ ℝ:" # = 1|! = ' = C N

R ∈ ℝ9:" # = S|! = ' = softmax R -

softmax 4 : 5-dimensional values in range[0,1] that add up to 1

sigmoid 4 : value in range [0, 1]

(equivalent: Bernoulli 6)

(equivalent: Multinomial 6!, … , 6")

Page 45: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

Softmax test metric: Top-5 error

45

# = 1 " # = 1|! = '5 0.148 0.137 0.122 0.109 0.104 0.091 0.090 0.096 0.083 0.05

Top-5 classification errorWhat % of datapointsdid not have the correct class label in the top-5predictions?

(class label: 5)

(probabilities of predictions)

Page 46: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

ImageNet classification

46

22,000 categories

14,000,000 images

Hand-engineered features (SIFT, HOG, LBP), Spatial pyramid, SparseCoding/Compression

Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

…smoothhound, smoothhound shark, Mustelus mustelusAmerican smooth dogfish, Mustelus canisFlorida smoothhound, Mustelus norrisiwhitetip shark, reef whitetip shark, Triaenodon obseusAtlantic spiny dogfish, Squalus acanthiasPacific spiny dogfish, Squalus suckleyihammerhead, hammerhead sharksmooth hammerhead, Sphyrna zygaenasmalleye hammerhead, Sphyrna tudesshovelhead, bonnethead, bonnet shark, Sphyrna tiburoangel shark, angelfish, Squatina squatina, monkfishelectric ray, crampfish, numbfish, torpedosmalltooth sawfish, Pristis pectinatusguitarfishroughtail stingray, Dasyatis centrourabutterfly rayeagle rayspotted eagle ray, spotted ray, Aetobatus narinaricownose ray, cow-nosed ray, Rhinoptera bonasusmanta, manta ray, devilfishAtlantic manta, Manta birostrisdevil ray, Mobula hypostomagrey skate, gray skate, Raja batislittle skate, Raja erinacea…

Stingray

Mantaray

Page 47: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

ImageNet classification challenge

47

22,000 categories

14,000,000 images

Hand-engineered features (SIFT, HOG, LBP), Spatial pyramid, SparseCoding/Compression

Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

…smoothhound, smoothhound shark, Mustelus mustelusAmerican smooth dogfish, Mustelus canisFlorida smoothhound, Mustelus norrisiwhitetip shark, reef whitetip shark, Triaenodon obseusAtlantic spiny dogfish, Squalus acanthiasPacific spiny dogfish, Squalus suckleyihammerhead, hammerhead sharksmooth hammerhead, Sphyrna zygaenasmalleye hammerhead, Sphyrna tudesshovelhead, bonnethead, bonnet shark, Sphyrna tiburoangel shark, angelfish, Squatina squatina, monkfishelectric ray, crampfish, numbfish, torpedosmalltooth sawfish, Pristis pectinatusguitarfishroughtail stingray, Dasyatis centrourabutterfly rayeagle rayspotted eagle ray, spotted ray, Aetobatus narinaricownose ray, cow-nosed ray, Rhinoptera bonasusmanta, manta ray, devilfishAtlantic manta, Manta birostrisdevil ray, Mobula hypostomagrey skate, gray skate, Raja batislittle skate, Raja erinacea…

1000 categories

200,000 images in test set1,200,000 images in train set

Page 48: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

ImageNet challenge: Top-5 classification error

48

99.5%Random guess

" true class label not in 5 guesses =

9995

10005

=995

1000

(lower is better)

Page 49: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

ImageNet challenge: Top-5 classification error

49

99.5%Random guess

25.8%

?

Pre-Neural Networks

GoogLe Net(2015)

16.4%

5.1%Humans(2014)

(lower is better)

Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge. IJCV 2015Szegedy et al., Going Deeper With Convolutions. CVPR 2015Hu et al., Squeeze-and-Excitation Networks. Preprint arXiV 2017

Page 50: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

ImageNet challenge: Top-5 classification error

50

99.5%Random guess

25.8%

?

Pre-Neural Networks

GoogLe Net(2015)

16.4%

5.1%Humans(2014)

SENet(2017)

2.25%

Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge. IJCV 2015Szegedy et al., Going Deeper With Convolutions. CVPR 2015Hu et al., Squeeze-and-Excitation Networks. Preprint arXiV 2017

(lower is better)

Page 51: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

GoogLeNet (2015)

51

1 Trillion Artificial Neurons(btw human brains have 1 billion neurons)

22 layers deep!

Multiple,Multi class output

Szegedy et al., Going Deeper With Convolutions. CVPR 2015

Page 52: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

Good ML = Generalization

Goal of machine learning: build models that generalize wellto predicting new dataOverfitting: fitting the training data too well,so we lose generality of model

• Polynomial on the right fits training data perfectly!• Which would you rather use to predict a new data point?

52

Page 53: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

Prevent overfitting?

53

Dropout (training technique)When your model is training, randomly turn off your neurons with probability 0.5. It will make your network more robust.

Page 54: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

Making decisions?

54

Not everything is classification.

Deep Reinforcement LearningInstead of having the output of a model be a probability, you make output an expectation.

http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html

Page 55: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

Deep Reinforcement Learning

55

http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html

Deep Mind Atari GamesScore compared to best human

Page 56: 27: Intro to Deep Learning + Wrap-up

What’s missing?

56

Page 57: 27: Intro to Deep Learning + Wrap-up

Where is your data coming from?

57

Page 58: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

Ethics and datasets

Sometimes machine learning feels universally unbiased.We can even prove our estimators are “unbiased” (mathematically).Google/Nikon/HP had biased datasets.

58

Page 59: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

Should your data be unbiased?

59

Man is to Computer Programmer as Woman is to Homemaker?

Debiasing Word Embeddings

Tolga Bolukbasi1, Kai-Wei Chang2, James Zou2, Venkatesh Saligrama1,2, Adam Kalai21Boston University, 8 Saint Mary’s Street, Boston, MA

2Microsoft Research New England, 1 Memorial Drive, Cambridge, [email protected], [email protected], [email protected], [email protected], [email protected]

AbstractThe blind application of machine learning runs the risk of amplifying biases present in data. Such a

danger is facing us with word embedding, a popular framework to represent text data as vectors whichhas been used in many machine learning and natural language processing tasks. We show that evenword embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbingextent. This raises concerns because their widespread use, as we describe, often tends to amplify thesebiases. Geometrically, gender bias is first shown to be captured by a direction in the word embedding.Second, gender neutral words are shown to be linearly separable from gender definition words in the wordembedding. Using these properties, we provide a methodology for modifying an embedding to removegender stereotypes, such as the association between between the words receptionist and female, whilemaintaining desired associations such as between the words queen and female. We define metrics toquantify both direct and indirect gender biases in embeddings, and develop algorithms to “debias” theembedding. Using crowd-worker evaluation as well as standard benchmarks, we empirically demonstratethat our algorithms significantly reduce gender bias in embeddings while preserving the its useful propertiessuch as the ability to cluster related concepts and to solve analogy tasks. The resulting embeddings canbe used in applications without amplifying gender bias.

1 Introduction

There have been hundreds or thousands of papers written about word embeddings and their applications,from Web search [27] to parsing Curriculum Vitae [16]. However, none of these papers have recognized howblatantly sexist the embeddings are and hence risk introducing biases of various types into real-world systems.

A word embedding that represent each word (or common phrase) w as a d-dimensional word vector~w 2 Rd. Word embeddings, trained only on word co-occurrence in text corpora, serve as a dictionary of sortsfor computer programs that would like to use word meaning. First, words with similar semantic meaningstend to have vectors that are close together. Second, the vector differences between words in embeddingshave been shown to represent relationships between words [32, 26]. For example given an analogy puzzle,“man is to king as woman is to x” (denoted as man:king :: woman:x), simple arithmetic of the embeddingvectors finds that x=queen is the best answer because:

��!man �����!woman ⇡��!king ����!queen

Similarly, x=Japan is returned for Paris:France :: Tokyo:x. It is surprising that a simple vector arithmeticcan simultaneously capture a variety of relationships. It has also excited practitioners because such a toolcould be useful across applications involving natural language. Indeed, they are being studied and usedin a variety of downstream applications (e.g., document ranking [27], sentiment analysis [18], and questionretrieval [22]).

However, the embeddings also pinpoint sexism implicit in text. For instance, it is also the case that:

��!man �����!woman ⇡ ����������������!computer programmer ���������!homemaker.

1

arX

iv:1

607.

0652

0v1

[cs.C

L] 2

1 Ju

l 201

6

Man is to Computer Programmer as Woman is to Homemaker?

Debiasing Word Embeddings

Tolga Bolukbasi1, Kai-Wei Chang2, James Zou2, Venkatesh Saligrama1,2, Adam Kalai21Boston University, 8 Saint Mary’s Street, Boston, MA

2Microsoft Research New England, 1 Memorial Drive, Cambridge, [email protected], [email protected], [email protected], [email protected], [email protected]

AbstractThe blind application of machine learning runs the risk of amplifying biases present in data. Such a

danger is facing us with word embedding, a popular framework to represent text data as vectors whichhas been used in many machine learning and natural language processing tasks. We show that evenword embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbingextent. This raises concerns because their widespread use, as we describe, often tends to amplify thesebiases. Geometrically, gender bias is first shown to be captured by a direction in the word embedding.Second, gender neutral words are shown to be linearly separable from gender definition words in the wordembedding. Using these properties, we provide a methodology for modifying an embedding to removegender stereotypes, such as the association between between the words receptionist and female, whilemaintaining desired associations such as between the words queen and female. We define metrics toquantify both direct and indirect gender biases in embeddings, and develop algorithms to “debias” theembedding. Using crowd-worker evaluation as well as standard benchmarks, we empirically demonstratethat our algorithms significantly reduce gender bias in embeddings while preserving the its useful propertiessuch as the ability to cluster related concepts and to solve analogy tasks. The resulting embeddings canbe used in applications without amplifying gender bias.

1 Introduction

There have been hundreds or thousands of papers written about word embeddings and their applications,from Web search [27] to parsing Curriculum Vitae [16]. However, none of these papers have recognized howblatantly sexist the embeddings are and hence risk introducing biases of various types into real-world systems.

A word embedding that represent each word (or common phrase) w as a d-dimensional word vector~w 2 Rd. Word embeddings, trained only on word co-occurrence in text corpora, serve as a dictionary of sortsfor computer programs that would like to use word meaning. First, words with similar semantic meaningstend to have vectors that are close together. Second, the vector differences between words in embeddingshave been shown to represent relationships between words [32, 26]. For example given an analogy puzzle,“man is to king as woman is to x” (denoted as man:king :: woman:x), simple arithmetic of the embeddingvectors finds that x=queen is the best answer because:

��!man �����!woman ⇡��!king ����!queen

Similarly, x=Japan is returned for Paris:France :: Tokyo:x. It is surprising that a simple vector arithmeticcan simultaneously capture a variety of relationships. It has also excited practitioners because such a toolcould be useful across applications involving natural language. Indeed, they are being studied and usedin a variety of downstream applications (e.g., document ranking [27], sentiment analysis [18], and questionretrieval [22]).

However, the embeddings also pinpoint sexism implicit in text. For instance, it is also the case that:

��!man �����!woman ⇡ ����������������!computer programmer ���������!homemaker.

1

arX

iv:1

607.

0652

0v1

[cs.C

L] 2

1 Ju

l 201

6

Dataset: Google News

Bolukbasi et al., Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. NIPS 2016

Should our unbiased data collection reflect society’s systemic bias?

Page 60: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2019

How can we explain decisions?

60

If your task is image classification,reasoning about high-level features is relatively easy.Everything can be visualized.

What if you are trying to classify social outcomes?

• Criminal recidivism• Job performance• Policing • Terrorist risk• At-risk kids

Page 61: 27: Intro to Deep Learning + Wrap-up

Ethics in Machine Learning is a whole new field. !

61

Page 62: 27: Intro to Deep Learning + Wrap-up

CS109 Wrap-Up

62

LIVE

Page 63: 27: Intro to Deep Learning + Wrap-up

63

What have we learned in CS109?

Page 64: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

A wild journey

64

Computer science Probability

Page 65: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

From combinatorics to probability…

65

) 5 + ) 59 = 1

Page 66: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

…to random variables and the Central Limit Theorem…

66

Bernoulli

Poisson

Gaussian

Page 67: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

…to statistics, parameter estimation, and machine learning

67

A happyBhutanese person

77 & Flu Under-grad

TiredFever

and Learn

Page 68: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Lots and lots of analysis

68

Climate sensitivity

Bloom filters

Peer grading

Biometric keystroke recognition

Web MD inferencep-hacking?

Coursera A/B testing

Page 69: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Lots and lots of analysis

69

Heart

Ancestry Netflix

Page 70: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

After CS109

70

Statistics

Stats 200 – Statistical InferenceStats 208 – Intro to the BootstrapStats 209 – Group Methods/Causal Inference

Theory

CS161 – Algorithmic analysisStats 217 – Stochastic ProcessesCS238 – Decision Making Under UncertaintyCS228 – Probabilistic Graphical Models

Page 71: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

After CS109

71

AI

CS 221 – Intro to AICS 229 – Machine LearningCS 230 – Deep LearningCS 224N – Natural Language ProcessingCS 231N – Conv Neural Nets for Visual RecognitionCS 234 – Reinforcement Learning

Applications

CS 279 – Bio ComputationLiterally any class with numbers in it

Page 72: 27: Intro to Deep Learning + Wrap-up

72

What have we done together this quarter?

Page 73: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

The CS109 teaching team

73

Page 74: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020 74

Section

Python session Overleaf tutorial

Discussion, synchronous and asynchronous

Page 75: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Tea hours

75

Turns out jigsaw puzzles aren’t probabilistic!

Page 76: 27: Intro to Deep Learning + Wrap-up

76

What have you learned from CS109?

Page 77: 27: Intro to Deep Learning + Wrap-up

77

What do you want toremember in 5 years?

Page 78: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

What do you want to remember in the next 5 years?

78

Page 79: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

What do you want to remember in the next 5 years?

79

Page 80: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

What do you want to remember in the next 5 years?

80

Page 81: 27: Intro to Deep Learning + Wrap-up

81

Why study probability + CS?

Page 82: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Why study probability + CS?

82

Source: US Bureau of Labor Statistics

Page 83: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Why study probability + CS?

83

Interdisciplinary Closest thing to magic

Page 84: 27: Intro to Deep Learning + Wrap-up

Lisa Yan, CS109, 2020

Why study probability + CS?

84

Everyone is welcome!

Page 85: 27: Intro to Deep Learning + Wrap-up

85

Technology magnifies.What do we want

magnified?

Page 86: 27: Intro to Deep Learning + Wrap-up

86

You are all one step closer to improving the world.

(all of you!)

Page 87: 27: Intro to Deep Learning + Wrap-up

87

The end

See you soon… J