265
1/70 CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

  • Upload
    others

  • View
    14

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

1/70

CS7015 (Deep Learning) : Lecture 10Learning Vectorial Representations Of Words

Mitesh M. Khapra

Department of Computer Science and EngineeringIndian Institute of Technology Madras

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 2: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

2/70

Acknowledgments

‘word2vec Parameter Learning Explained’ by Xin Rong

‘word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method’ by Yoav Goldberg and Omer Levy

Sebastian Ruder’s blogs on word embeddingsa

aBlog1, Blog2, Blog3

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 3: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

3/70

Module 10.1: One-hot representations of words

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 4: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

4/70

Model

[5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7]

This is by far AAMIR KHAN’s best one. Finest

casting and terrific acting by all.

Let us start with a very simple mo-tivation for why we are interested invectorial representations of words

Suppose we are given an input streamof words (sentence, document, etc.)and we are interested in learningsome function of it (say, y =sentiments(words))

Say, we employ a machine learning al-gorithm (some mathematical model)for learning such a function (y =f(x))

We first need a way of converting theinput stream (or each word in thestream) to a vector x (a mathemat-ical quantity)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 5: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

4/70

Model

[5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7]

This is by far AAMIR KHAN’s best one. Finest

casting and terrific acting by all.

Let us start with a very simple mo-tivation for why we are interested invectorial representations of words

Suppose we are given an input streamof words (sentence, document, etc.)and we are interested in learningsome function of it (say, y =sentiments(words))

Say, we employ a machine learning al-gorithm (some mathematical model)for learning such a function (y =f(x))

We first need a way of converting theinput stream (or each word in thestream) to a vector x (a mathemat-ical quantity)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 6: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

4/70

Model

[5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7]

This is by far AAMIR KHAN’s best one. Finest

casting and terrific acting by all.

Let us start with a very simple mo-tivation for why we are interested invectorial representations of words

Suppose we are given an input streamof words (sentence, document, etc.)and we are interested in learningsome function of it (say, y =sentiments(words))

Say, we employ a machine learning al-gorithm (some mathematical model)for learning such a function (y =f(x))

We first need a way of converting theinput stream (or each word in thestream) to a vector x (a mathemat-ical quantity)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 7: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

4/70

Model

[5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7]

This is by far AAMIR KHAN’s best one. Finest

casting and terrific acting by all.

Let us start with a very simple mo-tivation for why we are interested invectorial representations of words

Suppose we are given an input streamof words (sentence, document, etc.)and we are interested in learningsome function of it (say, y =sentiments(words))

Say, we employ a machine learning al-gorithm (some mathematical model)for learning such a function (y =f(x))

We first need a way of converting theinput stream (or each word in thestream) to a vector x (a mathemat-ical quantity)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 8: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

5/70

Corpus:

Human machine interface for computerapplications

User opinion of computer system responsetime

User interface management system

System engineering for improved responsetime

V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]

machine: 0 1 0 ... 0 0 0

Given a corpus,

consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)

V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)

We need a representation for everyword in V

One very simple way of doing this isto use one-hot vectors of size |V |The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 9: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

5/70

Corpus:

Human machine interface for computerapplications

User opinion of computer system responsetime

User interface management system

System engineering for improved responsetime

V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]

machine: 0 1 0 ... 0 0 0

Given a corpus,

consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)

V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)

We need a representation for everyword in V

One very simple way of doing this isto use one-hot vectors of size |V |The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 10: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

5/70

Corpus:

Human machine interface for computerapplications

User opinion of computer system responsetime

User interface management system

System engineering for improved responsetime

V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]

machine: 0 1 0 ... 0 0 0

Given a corpus, consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)

V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)

We need a representation for everyword in V

One very simple way of doing this isto use one-hot vectors of size |V |The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 11: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

5/70

Corpus:

Human machine interface for computerapplications

User opinion of computer system responsetime

User interface management system

System engineering for improved responsetime

V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]

machine: 0 1 0 ... 0 0 0

Given a corpus, consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)

V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)

We need a representation for everyword in V

One very simple way of doing this isto use one-hot vectors of size |V |The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 12: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

5/70

Corpus:

Human machine interface for computerapplications

User opinion of computer system responsetime

User interface management system

System engineering for improved responsetime

V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]

machine: 0 1 0 ... 0 0 0

Given a corpus, consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)

V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)

We need a representation for everyword in V

One very simple way of doing this isto use one-hot vectors of size |V |The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 13: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

5/70

Corpus:

Human machine interface for computerapplications

User opinion of computer system responsetime

User interface management system

System engineering for improved responsetime

V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]

machine: 0 1 0 ... 0 0 0

Given a corpus, consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)

V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)

We need a representation for everyword in V

One very simple way of doing this isto use one-hot vectors of size |V |

The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 14: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

5/70

Corpus:

Human machine interface for computerapplications

User opinion of computer system responsetime

User interface management system

System engineering for improved responsetime

V = [human,machine, interface, for, computer,applications, user, opinion, of, system, response,time, management, engineering, improved]

machine: 0 1 0 ... 0 0 0

Given a corpus, consider the set Vof all unique words across all inputstreams (i.e., all sentences or docu-ments)

V is called the vocabulary of thecorpus (i.e., all sentences or docu-ments)

We need a representation for everyword in V

One very simple way of doing this isto use one-hot vectors of size |V |The representation of the i-th wordwill have a 1 in the i-th position anda 0 in the remaining |V |− 1 positions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 15: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

6/70

cat: 0 0 0 0 0 1 0

dog: 0 1 0 0 0 0 0

truck: 0 0 0 1 0 0 0

euclid dist(cat,dog) =√

2

euclid dist(dog, truck) =√

2

cosine sim(cat,dog) = 0

cosine sim(dog, truck) = 0

Problems:

V tends to be very large (for example,50K for PTB, 13M for Google 1T cor-pus)

These representations do not captureany notion of similarity

Ideally, we would want the represent-ations of cat and dog (both domesticanimals) to be closer to each otherthan the representations of cat andtruck

However, with 1-hot representations,the Euclidean distance between anytwo words in the vocabulary in

√2

And the cosine similarity betweenany two words in the vocabulary is0

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 16: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

6/70

cat: 0 0 0 0 0 1 0

dog: 0 1 0 0 0 0 0

truck: 0 0 0 1 0 0 0

euclid dist(cat,dog) =√

2

euclid dist(dog, truck) =√

2

cosine sim(cat,dog) = 0

cosine sim(dog, truck) = 0

Problems:

V tends to be very large (for example,50K for PTB, 13M for Google 1T cor-pus)

These representations do not captureany notion of similarity

Ideally, we would want the represent-ations of cat and dog (both domesticanimals) to be closer to each otherthan the representations of cat andtruck

However, with 1-hot representations,the Euclidean distance between anytwo words in the vocabulary in

√2

And the cosine similarity betweenany two words in the vocabulary is0

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 17: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

6/70

cat: 0 0 0 0 0 1 0

dog: 0 1 0 0 0 0 0

truck: 0 0 0 1 0 0 0

euclid dist(cat,dog) =√

2

euclid dist(dog, truck) =√

2

cosine sim(cat,dog) = 0

cosine sim(dog, truck) = 0

Problems:

V tends to be very large (for example,50K for PTB, 13M for Google 1T cor-pus)

These representations do not captureany notion of similarity

Ideally, we would want the represent-ations of cat and dog (both domesticanimals) to be closer to each otherthan the representations of cat andtruck

However, with 1-hot representations,the Euclidean distance between anytwo words in the vocabulary in

√2

And the cosine similarity betweenany two words in the vocabulary is0

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 18: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

6/70

cat: 0 0 0 0 0 1 0

dog: 0 1 0 0 0 0 0

truck: 0 0 0 1 0 0 0

euclid dist(cat,dog) =√

2

euclid dist(dog, truck) =√

2

cosine sim(cat,dog) = 0

cosine sim(dog, truck) = 0

Problems:

V tends to be very large (for example,50K for PTB, 13M for Google 1T cor-pus)

These representations do not captureany notion of similarity

Ideally, we would want the represent-ations of cat and dog (both domesticanimals) to be closer to each otherthan the representations of cat andtruck

However, with 1-hot representations,the Euclidean distance between anytwo words in the vocabulary in

√2

And the cosine similarity betweenany two words in the vocabulary is0

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 19: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

6/70

cat: 0 0 0 0 0 1 0

dog: 0 1 0 0 0 0 0

truck: 0 0 0 1 0 0 0

euclid dist(cat,dog) =√

2

euclid dist(dog, truck) =√

2

cosine sim(cat,dog) = 0

cosine sim(dog, truck) = 0

Problems:

V tends to be very large (for example,50K for PTB, 13M for Google 1T cor-pus)

These representations do not captureany notion of similarity

Ideally, we would want the represent-ations of cat and dog (both domesticanimals) to be closer to each otherthan the representations of cat andtruck

However, with 1-hot representations,the Euclidean distance between anytwo words in the vocabulary in

√2

And the cosine similarity betweenany two words in the vocabulary is0

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 20: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

7/70

Module 10.2: Distributed Representations of words

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 21: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

8/70

A bank is a financial institution that acceptsdeposits from the public and creates credit.

The idea is to use the accompanying words(financial, deposits, credit) to represent bank

You shall know a word by the com-pany it keeps - Firth, J. R. 1957:11

Distributional similarity based rep-resentations

This leads us to the idea of co-occurrence matrix

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 22: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

8/70

A bank is a financial institution that acceptsdeposits from the public and creates credit.

The idea is to use the accompanying words(financial, deposits, credit) to represent bank

You shall know a word by the com-pany it keeps - Firth, J. R. 1957:11

Distributional similarity based rep-resentations

This leads us to the idea of co-occurrence matrix

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 23: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

8/70

A bank is a financial institution that acceptsdeposits from the public and creates credit.

The idea is to use the accompanying words(financial, deposits, credit) to represent bank

You shall know a word by the com-pany it keeps - Firth, J. R. 1957:11

Distributional similarity based rep-resentations

This leads us to the idea of co-occurrence matrix

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 24: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

8/70

A bank is a financial institution that acceptsdeposits from the public and creates credit.

The idea is to use the accompanying words(financial, deposits, credit) to represent bank

You shall know a word by the com-pany it keeps - Firth, J. R. 1957:11

Distributional similarity based rep-resentations

This leads us to the idea of co-occurrence matrix

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 25: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

9/70

Corpus:

Human machine interface for computer ap-plications

User opinion of computer system responsetime

User interface management system

System engineering for improved responsetime

human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2

for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 2 0 ... 0

Co-occurence Matrix

A co-occurrence matrix is a terms×terms matrix which captures thenumber of times a term appears inthe context of another term

The context is defined as a window ofk words around the terms

Let us build a co-occurrence matrixfor this toy corpus with k = 2

This is also known as a word ×context matrix

You could choose the set of wordsand contexts to be same or different

Each row (column) of the co-occurrence matrix gives a vectorialrepresentation of the correspondingword (context)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 26: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

9/70

Corpus:

Human machine interface for computer ap-plications

User opinion of computer system responsetime

User interface management system

System engineering for improved responsetime

human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2

for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 2 0 ... 0

Co-occurence Matrix

A co-occurrence matrix is a terms×terms matrix which captures thenumber of times a term appears inthe context of another term

The context is defined as a window ofk words around the terms

Let us build a co-occurrence matrixfor this toy corpus with k = 2

This is also known as a word ×context matrix

You could choose the set of wordsand contexts to be same or different

Each row (column) of the co-occurrence matrix gives a vectorialrepresentation of the correspondingword (context)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 27: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

9/70

Corpus:

Human machine interface for computer ap-plications

User opinion of computer system responsetime

User interface management system

System engineering for improved responsetime

human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2

for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 2 0 ... 0

Co-occurence Matrix

A co-occurrence matrix is a terms×terms matrix which captures thenumber of times a term appears inthe context of another term

The context is defined as a window ofk words around the terms

Let us build a co-occurrence matrixfor this toy corpus with k = 2

This is also known as a word ×context matrix

You could choose the set of wordsand contexts to be same or different

Each row (column) of the co-occurrence matrix gives a vectorialrepresentation of the correspondingword (context)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 28: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

9/70

Corpus:

Human machine interface for computer ap-plications

User opinion of computer system responsetime

User interface management system

System engineering for improved responsetime

human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2

for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 2 0 ... 0

Co-occurence Matrix

A co-occurrence matrix is a terms×terms matrix which captures thenumber of times a term appears inthe context of another term

The context is defined as a window ofk words around the terms

Let us build a co-occurrence matrixfor this toy corpus with k = 2

This is also known as a word ×context matrix

You could choose the set of wordsand contexts to be same or different

Each row (column) of the co-occurrence matrix gives a vectorialrepresentation of the correspondingword (context)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 29: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

9/70

Corpus:

Human machine interface for computer ap-plications

User opinion of computer system responsetime

User interface management system

System engineering for improved responsetime

human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2

for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 2 0 ... 0

Co-occurence Matrix

A co-occurrence matrix is a terms×terms matrix which captures thenumber of times a term appears inthe context of another term

The context is defined as a window ofk words around the terms

Let us build a co-occurrence matrixfor this toy corpus with k = 2

This is also known as a word ×context matrix

You could choose the set of wordsand contexts to be same or different

Each row (column) of the co-occurrence matrix gives a vectorialrepresentation of the correspondingword (context)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 30: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

9/70

Corpus:

Human machine interface for computer ap-plications

User opinion of computer system responsetime

User interface management system

System engineering for improved responsetime

human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2

for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 2 0 ... 0

Co-occurence Matrix

A co-occurrence matrix is a terms×terms matrix which captures thenumber of times a term appears inthe context of another term

The context is defined as a window ofk words around the terms

Let us build a co-occurrence matrixfor this toy corpus with k = 2

This is also known as a word ×context matrix

You could choose the set of wordsand contexts to be same or different

Each row (column) of the co-occurrence matrix gives a vectorialrepresentation of the correspondingword (context)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 31: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

10/70

human machine system for ... userhuman 0 1 0 1 ... 0machine 1 0 0 1 ... 0system 0 0 0 1 ... 2

for 1 1 1 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 2 0 ... 0

Some (fixable) problems

Stop words (a, the, for, etc.) are veryfrequent → these counts will be veryhigh

Solution 1: Ignore very frequentwords

Solution 2: Use a threshold t (say, t= 100)

Xij = min(count(wi, cj), t),

where w is word and c is context.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 32: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

10/70

human machine system ... userhuman 0 1 0 ... 0machine 1 0 0 ... 0system 0 0 0 ... 2

. . . . . .

. . . . . .

. . . . . .user 0 0 2 ... 0

Some (fixable) problems

Stop words (a, the, for, etc.) are veryfrequent → these counts will be veryhigh

Solution 1: Ignore very frequentwords

Solution 2: Use a threshold t (say, t= 100)

Xij = min(count(wi, cj), t),

where w is word and c is context.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 33: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

10/70

human machine system for ... userhuman 0 1 0 x ... 0machine 1 0 0 x ... 0system 0 0 0 x ... 2

for x x x x ... x. . . . . . .. . . . . . .. . . . . . .

user 0 0 2 x ... 0

Some (fixable) problems

Stop words (a, the, for, etc.) are veryfrequent → these counts will be veryhigh

Solution 1: Ignore very frequentwords

Solution 2: Use a threshold t (say, t= 100)

Xij = min(count(wi, cj), t),

where w is word and c is context.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 34: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

11/70

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

Some (fixable) problems

Solution 3: Instead of count(w, c) usePMI(w, c)

PMI(w, c) = logp(c|w)

p(c)

= logcount(w, c) ∗N

count(c) ∗ count(w)

N is the total number of words

If count(w, c) = 0, PMI(w, c) = −∞

Instead use,

PMI0(w, c) = PMI(w, c) if count(w, c) > 0

= 0 otherwise

or

PPMI(w, c) = PMI(w, c) if PMI(w, c) > 0

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 35: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

11/70

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

Some (fixable) problems

Solution 3: Instead of count(w, c) usePMI(w, c)

PMI(w, c) = logp(c|w)

p(c)

= logcount(w, c) ∗N

count(c) ∗ count(w)

N is the total number of words

If count(w, c) = 0, PMI(w, c) = −∞

Instead use,

PMI0(w, c) = PMI(w, c) if count(w, c) > 0

= 0 otherwise

or

PPMI(w, c) = PMI(w, c) if PMI(w, c) > 0

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 36: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

11/70

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

Some (fixable) problems

Solution 3: Instead of count(w, c) usePMI(w, c)

PMI(w, c) = logp(c|w)

p(c)

= logcount(w, c) ∗N

count(c) ∗ count(w)

N is the total number of words

If count(w, c) = 0, PMI(w, c) = −∞

Instead use,

PMI0(w, c) = PMI(w, c) if count(w, c) > 0

= 0 otherwise

or

PPMI(w, c) = PMI(w, c) if PMI(w, c) > 0

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 37: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

11/70

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

Some (fixable) problems

Solution 3: Instead of count(w, c) usePMI(w, c)

PMI(w, c) = logp(c|w)

p(c)

= logcount(w, c) ∗N

count(c) ∗ count(w)

N is the total number of words

If count(w, c) = 0, PMI(w, c) = −∞

Instead use,

PMI0(w, c) = PMI(w, c) if count(w, c) > 0

= 0 otherwise

or

PPMI(w, c) = PMI(w, c) if PMI(w, c) > 0

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 38: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

11/70

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

Some (fixable) problems

Solution 3: Instead of count(w, c) usePMI(w, c)

PMI(w, c) = logp(c|w)

p(c)

= logcount(w, c) ∗N

count(c) ∗ count(w)

N is the total number of words

If count(w, c) = 0, PMI(w, c) = −∞

Instead use,

PMI0(w, c) = PMI(w, c) if count(w, c) > 0

= 0 otherwise

or

PPMI(w, c) = PMI(w, c) if PMI(w, c) > 0

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 39: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

11/70

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

Some (fixable) problems

Solution 3: Instead of count(w, c) usePMI(w, c)

PMI(w, c) = logp(c|w)

p(c)

= logcount(w, c) ∗N

count(c) ∗ count(w)

N is the total number of words

If count(w, c) = 0, PMI(w, c) = −∞

Instead use,

PMI0(w, c) = PMI(w, c) if count(w, c) > 0

= 0 otherwise

or

PPMI(w, c) = PMI(w, c) if PMI(w, c) > 0

= 0 otherwise

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 40: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

12/70

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

Some (severe) problems

Very high dimensional (|V |)

Very sparse

Grows with the size of the vocabulary

Solution: Use dimensionality reduc-tion (SVD)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 41: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

12/70

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

Some (severe) problems

Very high dimensional (|V |)Very sparse

Grows with the size of the vocabulary

Solution: Use dimensionality reduc-tion (SVD)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 42: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

12/70

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

Some (severe) problems

Very high dimensional (|V |)Very sparse

Grows with the size of the vocabulary

Solution: Use dimensionality reduc-tion (SVD)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 43: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

12/70

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

Some (severe) problems

Very high dimensional (|V |)Very sparse

Grows with the size of the vocabulary

Solution: Use dimensionality reduc-tion (SVD)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 44: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

13/70

Module 10.3: SVD for learning word representations

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 45: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

14/70

X

m×n

=

↑ · · · ↑

u1 · · · uk↓ · · · ↓

m×k

σ1

. . .

σk

k×k

← vT1 →...

← vTk →

k×n

Singular Value Decompositiongives a rank k approximation ofthe original matrix

X = XPPMIm×n = Um×kΣk×kVTk×n

XPPMI (simplifying notation toX) is the co-occurrence matrixwith PPMI values

SVD gives the best rank-k ap-proximation of the original data(X)

Discovers latent semantics in thecorpus (let us examine this withthe help of an example)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 46: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

14/70

X

m×n

=

↑ · · · ↑

u1 · · · uk↓ · · · ↓

m×k

σ1

. . .

σk

k×k

← vT1 →...

← vTk →

k×n

Singular Value Decompositiongives a rank k approximation ofthe original matrix

X = XPPMIm×n = Um×kΣk×kVTk×n

XPPMI (simplifying notation toX) is the co-occurrence matrixwith PPMI values

SVD gives the best rank-k ap-proximation of the original data(X)

Discovers latent semantics in thecorpus (let us examine this withthe help of an example)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 47: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

14/70

X

m×n

=

↑ · · · ↑

u1 · · · uk↓ · · · ↓

m×k

σ1

. . .

σk

k×k

← vT1 →...

← vTk →

k×n

Singular Value Decompositiongives a rank k approximation ofthe original matrix

X = XPPMIm×n = Um×kΣk×kVTk×n

XPPMI (simplifying notation toX) is the co-occurrence matrixwith PPMI values

SVD gives the best rank-k ap-proximation of the original data(X)

Discovers latent semantics in thecorpus (let us examine this withthe help of an example)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 48: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

15/70

X

m×n

=

↑ · · · ↑

u1 · · · uk↓ · · · ↓

m×k

σ1

. . .

σk

k×k

← vT1 →...

← vTk →

k×n

= σ1u1vT1 + σ2u2v

T2 + · · ·+ σkukv

Tk

Notice that the product can bewritten as a sum of k rank-1matrices

Each σiuivTi ∈ Rm×n because it

is a product of a m × 1 vectorwith a 1× n vector

If we truncate the sum at σ1u1vT1

then we get the best rank-1 ap-proximation of X

(By SVD the-orem! But what does this mean?We will see on the next slide)

If we truncate the sum atσ1u1v

T1 +σ2u2v

T2 then we get the

best rank-2 approximation of Xand so on

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 49: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

15/70

X

m×n

=

↑ · · · ↑

u1 · · · uk↓ · · · ↓

m×k

σ1

. . .

σk

k×k

← vT1 →...

← vTk →

k×n

= σ1u1vT1 + σ2u2v

T2 + · · ·+ σkukv

Tk

Notice that the product can bewritten as a sum of k rank-1matrices

Each σiuivTi ∈ Rm×n because it

is a product of a m × 1 vectorwith a 1× n vector

If we truncate the sum at σ1u1vT1

then we get the best rank-1 ap-proximation of X

(By SVD the-orem! But what does this mean?We will see on the next slide)

If we truncate the sum atσ1u1v

T1 +σ2u2v

T2 then we get the

best rank-2 approximation of Xand so on

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 50: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

15/70

X

m×n

=

↑ · · · ↑

u1 · · · uk↓ · · · ↓

m×k

σ1

. . .

σk

k×k

← vT1 →...

← vTk →

k×n

= σ1u1vT1 + σ2u2v

T2 + · · ·+ σkukv

Tk

Notice that the product can bewritten as a sum of k rank-1matrices

Each σiuivTi ∈ Rm×n because it

is a product of a m × 1 vectorwith a 1× n vector

If we truncate the sum at σ1u1vT1

then we get the best rank-1 ap-proximation of X

(By SVD the-orem! But what does this mean?We will see on the next slide)

If we truncate the sum atσ1u1v

T1 +σ2u2v

T2 then we get the

best rank-2 approximation of Xand so on

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 51: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

15/70

X

m×n

=

↑ · · · ↑

u1 · · · uk↓ · · · ↓

m×k

σ1

. . .

σk

k×k

← vT1 →...

← vTk →

k×n

= σ1u1vT1 + σ2u2v

T2 + · · ·+ σkukv

Tk

Notice that the product can bewritten as a sum of k rank-1matrices

Each σiuivTi ∈ Rm×n because it

is a product of a m × 1 vectorwith a 1× n vector

If we truncate the sum at σ1u1vT1

then we get the best rank-1 ap-proximation of X (By SVD the-orem! But what does this mean?We will see on the next slide)

If we truncate the sum atσ1u1v

T1 +σ2u2v

T2 then we get the

best rank-2 approximation of Xand so on

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 52: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

15/70

X

m×n

=

↑ · · · ↑

u1 · · · uk↓ · · · ↓

m×k

σ1

. . .

σk

k×k

← vT1 →...

← vTk →

k×n

= σ1u1vT1 + σ2u2v

T2 + · · ·+ σkukv

Tk

Notice that the product can bewritten as a sum of k rank-1matrices

Each σiuivTi ∈ Rm×n because it

is a product of a m × 1 vectorwith a 1× n vector

If we truncate the sum at σ1u1vT1

then we get the best rank-1 ap-proximation of X (By SVD the-orem! But what does this mean?We will see on the next slide)

If we truncate the sum atσ1u1v

T1 +σ2u2v

T2 then we get the

best rank-2 approximation of Xand so on

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 53: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

16/70

X

m×n

=

↑ · · · ↑

u1 · · · uk↓ · · · ↓

m×k

σ1

. . .

σk

k×k

← vT1 →...

← vTk →

k×n

= σ1u1vT1 + σ2u2v

T2 + · · ·+ σkukv

Tk

What do we mean by approxim-ation here?

Notice that X has m× n entries

When we use he rank-1 approx-imation we are using only n +m+ 1 entries to reconstruct [u ∈Rm, v ∈ Rn, σ ∈ R1]

But SVD theorem tells us thatu1,v1 and σ1 store the most in-formation in X (akin to the prin-cipal components in X)

Each subsequent term (σ2u2vT2 ,

σ3u3vT3 , . . . ) stores less and less

important information

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 54: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

16/70

X

m×n

=

↑ · · · ↑

u1 · · · uk↓ · · · ↓

m×k

σ1

. . .

σk

k×k

← vT1 →...

← vTk →

k×n

= σ1u1vT1 + σ2u2v

T2 + · · ·+ σkukv

Tk

What do we mean by approxim-ation here?

Notice that X has m× n entries

When we use he rank-1 approx-imation we are using only n +m+ 1 entries to reconstruct [u ∈Rm, v ∈ Rn, σ ∈ R1]

But SVD theorem tells us thatu1,v1 and σ1 store the most in-formation in X (akin to the prin-cipal components in X)

Each subsequent term (σ2u2vT2 ,

σ3u3vT3 , . . . ) stores less and less

important information

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 55: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

16/70

X

m×n

=

↑ · · · ↑

u1 · · · uk↓ · · · ↓

m×k

σ1

. . .

σk

k×k

← vT1 →...

← vTk →

k×n

= σ1u1vT1 + σ2u2v

T2 + · · ·+ σkukv

Tk

What do we mean by approxim-ation here?

Notice that X has m× n entries

When we use he rank-1 approx-imation we are using only n +m+ 1 entries to reconstruct [u ∈Rm, v ∈ Rn, σ ∈ R1]

But SVD theorem tells us thatu1,v1 and σ1 store the most in-formation in X (akin to the prin-cipal components in X)

Each subsequent term (σ2u2vT2 ,

σ3u3vT3 , . . . ) stores less and less

important information

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 56: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

16/70

X

m×n

=

↑ · · · ↑

u1 · · · uk↓ · · · ↓

m×k

σ1

. . .

σk

k×k

← vT1 →...

← vTk →

k×n

= σ1u1vT1 + σ2u2v

T2 + · · ·+ σkukv

Tk

What do we mean by approxim-ation here?

Notice that X has m× n entries

When we use he rank-1 approx-imation we are using only n +m+ 1 entries to reconstruct [u ∈Rm, v ∈ Rn, σ ∈ R1]

But SVD theorem tells us thatu1,v1 and σ1 store the most in-formation in X (akin to the prin-cipal components in X)

Each subsequent term (σ2u2vT2 ,

σ3u3vT3 , . . . ) stores less and less

important information

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 57: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

16/70

X

m×n

=

↑ · · · ↑

u1 · · · uk↓ · · · ↓

m×k

σ1

. . .

σk

k×k

← vT1 →...

← vTk →

k×n

= σ1u1vT1 + σ2u2v

T2 + · · ·+ σkukv

Tk

What do we mean by approxim-ation here?

Notice that X has m× n entries

When we use he rank-1 approx-imation we are using only n +m+ 1 entries to reconstruct [u ∈Rm, v ∈ Rn, σ ∈ R1]

But SVD theorem tells us thatu1,v1 and σ1 store the most in-formation in X (akin to the prin-cipal components in X)

Each subsequent term (σ2u2vT2 ,

σ3u3vT3 , . . . ) stores less and less

important information

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 58: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

17/70

verylight︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 0 1 1 0 1 1

light︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 1 0 1 0 1 1

dark︷ ︸︸ ︷ green︷ ︸︸ ︷0 1 0 0 1 0 1 1

verydark︷ ︸︸ ︷ green︷ ︸︸ ︷1 0 0 0 1 0 1 1

As an analogy consider the case whenwe are using 8 bits to represent colors

The representation of very light, light,dark and very dark green would lookdifferent

But now what if we were asked to com-press this into 4 bits? (akin to com-pressing m ×m values into m + m + 1values on the previous slide)

We will retain the most important 4bits and now the previously (slightly)latent similarity between the colors nowbecomes very obvious

Something similar is guaranteed bySVD (retain the most important in-formation and discover the latent sim-ilarities between words)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 59: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

17/70

verylight︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 0 1 1 0 1 1

light︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 1 0 1 0 1 1

dark︷ ︸︸ ︷ green︷ ︸︸ ︷0 1 0 0 1 0 1 1

verydark︷ ︸︸ ︷ green︷ ︸︸ ︷1 0 0 0 1 0 1 1

As an analogy consider the case whenwe are using 8 bits to represent colors

The representation of very light, light,dark and very dark green would lookdifferent

But now what if we were asked to com-press this into 4 bits? (akin to com-pressing m ×m values into m + m + 1values on the previous slide)

We will retain the most important 4bits and now the previously (slightly)latent similarity between the colors nowbecomes very obvious

Something similar is guaranteed bySVD (retain the most important in-formation and discover the latent sim-ilarities between words)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 60: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

17/70

verylight︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 0 1 1 0 1 1

light︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 1 0 1 0 1 1

dark︷ ︸︸ ︷ green︷ ︸︸ ︷0 1 0 0 1 0 1 1

verydark︷ ︸︸ ︷ green︷ ︸︸ ︷1 0 0 0 1 0 1 1

As an analogy consider the case whenwe are using 8 bits to represent colors

The representation of very light, light,dark and very dark green would lookdifferent

But now what if we were asked to com-press this into 4 bits? (akin to com-pressing m ×m values into m + m + 1values on the previous slide)

We will retain the most important 4bits and now the previously (slightly)latent similarity between the colors nowbecomes very obvious

Something similar is guaranteed bySVD (retain the most important in-formation and discover the latent sim-ilarities between words)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 61: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

17/70

verylight︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 0 1 1 0 1 1

light︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 1 0 1 0 1 1

dark︷ ︸︸ ︷ green︷ ︸︸ ︷0 1 0 0 1 0 1 1

verydark︷ ︸︸ ︷ green︷ ︸︸ ︷1 0 0 0 1 0 1 1

As an analogy consider the case whenwe are using 8 bits to represent colors

The representation of very light, light,dark and very dark green would lookdifferent

But now what if we were asked to com-press this into 4 bits? (akin to com-pressing m ×m values into m + m + 1values on the previous slide)

We will retain the most important 4bits and now the previously (slightly)latent similarity between the colors nowbecomes very obvious

Something similar is guaranteed bySVD (retain the most important in-formation and discover the latent sim-ilarities between words)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 62: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

17/70

verylight︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 0 1 1 0 1 1

light︷ ︸︸ ︷ green︷ ︸︸ ︷0 0 1 0 1 0 1 1

dark︷ ︸︸ ︷ green︷ ︸︸ ︷0 1 0 0 1 0 1 1

verydark︷ ︸︸ ︷ green︷ ︸︸ ︷1 0 0 0 1 0 1 1

As an analogy consider the case whenwe are using 8 bits to represent colors

The representation of very light, light,dark and very dark green would lookdifferent

But now what if we were asked to com-press this into 4 bits? (akin to com-pressing m ×m values into m + m + 1values on the previous slide)

We will retain the most important 4bits and now the previously (slightly)latent similarity between the colors nowbecomes very obvious

Something similar is guaranteed bySVD (retain the most important in-formation and discover the latent sim-ilarities between words)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 63: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

18/70

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

Co-occurrence Matrix (X)

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

Low rank X → Low rank X

Notice that after low rank reconstruction with SVD, the latent co-occurrencebetween {system,machine} and {human, user} has become visible

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 64: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

19/70

X =

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

cosine sim(human, user) = 0.21

Recall that earlier each row of the originalmatrix X served as the representation of aword

Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])

X[i :]

X[j :]

1 2 32 1 01 3 5

︸ ︷︷ ︸

X

1 2 12 1 33 0 5

︸ ︷︷ ︸

XT

=

. . 22. . .. . .

︸ ︷︷ ︸

XXT

The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 65: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

19/70

X =

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

XXT =

human machine system for ... userhuman 32.5 23.9 7.78 20.25 ... 7.01machine 23.9 32.5 7.78 20.25 ... 7.01system 7.78 7.78 0 17.65 ... 21.84

for 20.25 20.25 17.65 36.3 ... 11.8. . . . . . .. . . . . . .. . . . . . .

user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21

Recall that earlier each row of the originalmatrix X served as the representation of aword

Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])

X[i :]

X[j :]

1 2 32 1 01 3 5

︸ ︷︷ ︸

X

1 2 12 1 33 0 5

︸ ︷︷ ︸

XT

=

. . 22. . .. . .

︸ ︷︷ ︸

XXT

The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 66: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

19/70

X =

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

XXT =

human machine system for ... userhuman 32.5 23.9 7.78 20.25 ... 7.01machine 23.9 32.5 7.78 20.25 ... 7.01system 7.78 7.78 0 17.65 ... 21.84

for 20.25 20.25 17.65 36.3 ... 11.8. . . . . . .. . . . . . .. . . . . . .

user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21

Recall that earlier each row of the originalmatrix X served as the representation of aword

Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])

X[i :]

X[j :]

1 2 32 1 01 3 5

︸ ︷︷ ︸

X

1 2 12 1 33 0 5

︸ ︷︷ ︸

XT

=

. . 22. . .. . .

︸ ︷︷ ︸

XXT

The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 67: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

19/70

X =

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

XXT =

human machine system for ... userhuman 32.5 23.9 7.78 20.25 ... 7.01machine 23.9 32.5 7.78 20.25 ... 7.01system 7.78 7.78 0 17.65 ... 21.84

for 20.25 20.25 17.65 36.3 ... 11.8. . . . . . .. . . . . . .. . . . . . .

user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21

Recall that earlier each row of the originalmatrix X served as the representation of aword

Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])

X[i :]

X[j :]

1 2 32 1 01 3 5

︸ ︷︷ ︸

X

1 2 12 1 33 0 5

︸ ︷︷ ︸

XT

=

. . 22. . .. . .

︸ ︷︷ ︸

XXT

The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 68: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

19/70

X =

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

XXT =

human machine system for ... userhuman 32.5 23.9 7.78 20.25 ... 7.01machine 23.9 32.5 7.78 20.25 ... 7.01system 7.78 7.78 0 17.65 ... 21.84

for 20.25 20.25 17.65 36.3 ... 11.8. . . . . . .. . . . . . .. . . . . . .

user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21

Recall that earlier each row of the originalmatrix X served as the representation of aword

Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])

X[i :]

X[j :]

1 2 32 1 01 3 5

︸ ︷︷ ︸

X

1 2 12 1 33 0 5

︸ ︷︷ ︸

XT

=

. . 22. . .. . .

︸ ︷︷ ︸

XXT

The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 69: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

19/70

X =

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

XXT =

human machine system for ... userhuman 32.5 23.9 7.78 20.25 ... 7.01machine 23.9 32.5 7.78 20.25 ... 7.01system 7.78 7.78 0 17.65 ... 21.84

for 20.25 20.25 17.65 36.3 ... 11.8. . . . . . .. . . . . . .. . . . . . .

user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21

Recall that earlier each row of the originalmatrix X served as the representation of aword

Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])

X[i :]

X[j :]

1 2 32 1 01 3 5

︸ ︷︷ ︸

X

1 2 12 1 33 0 5

︸ ︷︷ ︸

XT

=

. . 22. . .. . .

︸ ︷︷ ︸

XXT

The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 70: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

19/70

X =

human machine system for ... userhuman 0 2.944 0 2.25 ... 0machine 2.944 0 0 2.25 ... 0system 0 0 0 1.15 ... 1.84

for 2.25 2.25 1.15 0 ... 0. . . . . . .. . . . . . .. . . . . . .

user 0 0 1.84 0 ... 0

XXT =

human machine system for ... userhuman 32.5 23.9 7.78 20.25 ... 7.01machine 23.9 32.5 7.78 20.25 ... 7.01system 7.78 7.78 0 17.65 ... 21.84

for 20.25 20.25 17.65 36.3 ... 11.8. . . . . . .. . . . . . .. . . . . . .

user 7.01 7.01 21.84 11.8 ... 28.3

cosine sim(human, user) = 0.21

Recall that earlier each row of the originalmatrix X served as the representation of aword

Then XXT is a matrix whose ij-th entry isthe dot product between the representationof word i (X[i :]) and word j (X[j :])

X[i :]

X[j :]

1 2 32 1 01 3 5

︸ ︷︷ ︸

X

1 2 12 1 33 0 5

︸ ︷︷ ︸

XT

=

. . 22. . .. . .

︸ ︷︷ ︸

XXT

The ij-th entry of XXT thus (roughly)captures the cosine similarity betweenwordi, wordj

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 71: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

20/70

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

XXT =

human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6

for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .

user 6.84 6.84 20.6 15.32 ... 17.11

cosine sim(human, user) = 0.33

Once we do an SVD what is agood choice for the representation ofwordi?

Obviously, taking the i-th row of thereconstructed matrix does not makesense because it is still high dimen-sional

But we saw that the reconstructedmatrix X = UΣV T discovers latentsemantics and its word representa-tions are more meaningful

Wishlist: We would want represent-ations of words (i, j) to be of smal-ler dimensions but still have the samesimilarity (dot product) as the corres-ponding rows of X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 72: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

20/70

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

XXT =

human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6

for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .

user 6.84 6.84 20.6 15.32 ... 17.11

cosine sim(human, user) = 0.33

Once we do an SVD what is agood choice for the representation ofwordi?

Obviously, taking the i-th row of thereconstructed matrix does not makesense because it is still high dimen-sional

But we saw that the reconstructedmatrix X = UΣV T discovers latentsemantics and its word representa-tions are more meaningful

Wishlist: We would want represent-ations of words (i, j) to be of smal-ler dimensions but still have the samesimilarity (dot product) as the corres-ponding rows of X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 73: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

20/70

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

XXT =

human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6

for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .

user 6.84 6.84 20.6 15.32 ... 17.11

cosine sim(human, user) = 0.33

Once we do an SVD what is agood choice for the representation ofwordi?

Obviously, taking the i-th row of thereconstructed matrix does not makesense because it is still high dimen-sional

But we saw that the reconstructedmatrix X = UΣV T discovers latentsemantics and its word representa-tions are more meaningful

Wishlist: We would want represent-ations of words (i, j) to be of smal-ler dimensions but still have the samesimilarity (dot product) as the corres-ponding rows of X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 74: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

20/70

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

XXT =

human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6

for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .

user 6.84 6.84 20.6 15.32 ... 17.11

cosine sim(human, user) = 0.33

Once we do an SVD what is agood choice for the representation ofwordi?

Obviously, taking the i-th row of thereconstructed matrix does not makesense because it is still high dimen-sional

But we saw that the reconstructedmatrix X = UΣV T discovers latentsemantics and its word representa-tions are more meaningful

Wishlist: We would want represent-ations of words (i, j) to be of smal-ler dimensions but still have the samesimilarity (dot product) as the corres-ponding rows of X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 75: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

21/70

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

XXT =

human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6

for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .

user 6.84 6.84 20.6 15.32 ... 17.11

similarity = 0.33

Notice that the dot product between therows of the the matrix Wword = UΣ is thesame as the dot product between the rowsof X

XXT = (UΣV T )(UΣV T )T

= (UΣV T )(V ΣUT )

= UΣΣTUT (∵ V TV = I)

= UΣ(UΣ)T = WwordWTword

Conventionally,

Wword = UΣ ∈ Rm×k

is taken as the representation of the m wordsin the vocabulary and

Wcontext = V

is taken as the representation of the contextwords

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 76: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

21/70

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

XXT =

human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6

for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .

user 6.84 6.84 20.6 15.32 ... 17.11

similarity = 0.33

Notice that the dot product between therows of the the matrix Wword = UΣ is thesame as the dot product between the rowsof X

XXT = (UΣV T )(UΣV T )T

= (UΣV T )(V ΣUT )

= UΣΣTUT (∵ V TV = I)

= UΣ(UΣ)T = WwordWTword

Conventionally,

Wword = UΣ ∈ Rm×k

is taken as the representation of the m wordsin the vocabulary and

Wcontext = V

is taken as the representation of the contextwords

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 77: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

21/70

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

XXT =

human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6

for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .

user 6.84 6.84 20.6 15.32 ... 17.11

similarity = 0.33

Notice that the dot product between therows of the the matrix Wword = UΣ is thesame as the dot product between the rowsof X

XXT = (UΣV T )(UΣV T )T

= (UΣV T )(V ΣUT )

= UΣΣTUT (∵ V TV = I)

= UΣ(UΣ)T = WwordWTword

Conventionally,

Wword = UΣ ∈ Rm×k

is taken as the representation of the m wordsin the vocabulary and

Wcontext = V

is taken as the representation of the contextwords

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 78: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

21/70

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

XXT =

human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6

for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .

user 6.84 6.84 20.6 15.32 ... 17.11

similarity = 0.33

Notice that the dot product between therows of the the matrix Wword = UΣ is thesame as the dot product between the rowsof X

XXT = (UΣV T )(UΣV T )T

= (UΣV T )(V ΣUT )

= UΣΣTUT (∵ V TV = I)

= UΣ(UΣ)T = WwordWTword

Conventionally,

Wword = UΣ ∈ Rm×k

is taken as the representation of the m wordsin the vocabulary and

Wcontext = V

is taken as the representation of the contextwords

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 79: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

21/70

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

XXT =

human machine system for ... userhuman 25.4 25.4 7.6 21.9 ... 6.84machine 25.4 25.4 7.6 21.9 ... 6.84system 7.6 7.6 24.8 18.03 ... 20.6

for 21.9 21.9 0.96 24.6 ... 15.32. . . . . . .. . . . . . .. . . . . . .

user 6.84 6.84 20.6 15.32 ... 17.11

similarity = 0.33

Notice that the dot product between therows of the the matrix Wword = UΣ is thesame as the dot product between the rowsof X

XXT = (UΣV T )(UΣV T )T

= (UΣV T )(V ΣUT )

= UΣΣTUT (∵ V TV = I)

= UΣ(UΣ)T = WwordWTword

Conventionally,

Wword = UΣ ∈ Rm×k

is taken as the representation of the m wordsin the vocabulary and

Wcontext = V

is taken as the representation of the contextwords

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 80: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

22/70

Module 10.4: Continuous bag of words model

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 81: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

23/70

The methods that we have seen so far are called count based models becausethey use the co-occurrence counts of words

We will now see methods which directly learn word representations (these arecalled (direct) prediction based models)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 82: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

23/70

The methods that we have seen so far are called count based models becausethey use the co-occurrence counts of words

We will now see methods which directly learn word representations (these arecalled (direct) prediction based models)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 83: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

24/70

The story ahead ...

Continuous bag of words model

Skip gram model with negative sampling (the famous word2vec)

GloVe word embeddings

Evaluating word embeddings

Good old SVD does just fine!!

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 84: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

24/70

The story ahead ...

Continuous bag of words model

Skip gram model with negative sampling (the famous word2vec)

GloVe word embeddings

Evaluating word embeddings

Good old SVD does just fine!!

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 85: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

24/70

The story ahead ...

Continuous bag of words model

Skip gram model with negative sampling (the famous word2vec)

GloVe word embeddings

Evaluating word embeddings

Good old SVD does just fine!!

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 86: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

24/70

The story ahead ...

Continuous bag of words model

Skip gram model with negative sampling (the famous word2vec)

GloVe word embeddings

Evaluating word embeddings

Good old SVD does just fine!!

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 87: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

24/70

The story ahead ...

Continuous bag of words model

Skip gram model with negative sampling (the famous word2vec)

GloVe word embeddings

Evaluating word embeddings

Good old SVD does just fine!!

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 88: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

25/70

Sometime in the 21st century, Joseph Cooper,a widowed former engineer and former NASA

pilot, runs a farm with his father-in-law Donald,son Tom, and daughter Murphy, It is post-truthsociety ( Cooper is reprimanded for tellingMurphy that the Apollo missions did indeedhappen) and a series of crop blights threatens hu-

manity’s survival. Murphy believes her bedroom

is haunted by a poltergeist. When a pattern

is created out of dust on the floor, Cooperrealizes that gravity is behind its formation,not a ”ghost”. He interprets the pattern asa set of geographic coordinates formed intobinary code. Cooper and Murphy follow thecoordinates to a secret NASA facility, where theyare met by Cooper’s former professor, Dr. Brand.

Some sample 4 word windows from a corpus

Consider this Task: Predict n-thword given previous n-1 words

Example: he sat on a chair

Training data: All n-word windowsin your corpus

Training data for this task is easilyavailable (take all n word windowsfrom the whole of wikipedia)

For ease of illustration, we will firstfocus on the case when n = 2 (i.e.,predict second word based on firstword)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 89: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

25/70

Sometime in the 21st century, Joseph Cooper,a widowed former engineer and former NASA

pilot, runs a farm with his father-in-law Donald,son Tom, and daughter Murphy, It is post-truthsociety ( Cooper is reprimanded for tellingMurphy that the Apollo missions did indeedhappen) and a series of crop blights threatens hu-

manity’s survival. Murphy believes her bedroom

is haunted by a poltergeist. When a pattern

is created out of dust on the floor, Cooperrealizes that gravity is behind its formation,not a ”ghost”. He interprets the pattern asa set of geographic coordinates formed intobinary code. Cooper and Murphy follow thecoordinates to a secret NASA facility, where theyare met by Cooper’s former professor, Dr. Brand.

Some sample 4 word windows from a corpus

Consider this Task: Predict n-thword given previous n-1 words

Example: he sat on a chair

Training data: All n-word windowsin your corpus

Training data for this task is easilyavailable (take all n word windowsfrom the whole of wikipedia)

For ease of illustration, we will firstfocus on the case when n = 2 (i.e.,predict second word based on firstword)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 90: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

25/70

Sometime in the 21st century, Joseph Cooper,a widowed former engineer and former NASA

pilot, runs a farm with his father-in-law Donald,son Tom, and daughter Murphy, It is post-truthsociety ( Cooper is reprimanded for tellingMurphy that the Apollo missions did indeedhappen) and a series of crop blights threatens hu-

manity’s survival. Murphy believes her bedroom

is haunted by a poltergeist. When a pattern

is created out of dust on the floor, Cooperrealizes that gravity is behind its formation,not a ”ghost”. He interprets the pattern asa set of geographic coordinates formed intobinary code. Cooper and Murphy follow thecoordinates to a secret NASA facility, where theyare met by Cooper’s former professor, Dr. Brand.

Some sample 4 word windows from a corpus

Consider this Task: Predict n-thword given previous n-1 words

Example: he sat on a chair

Training data: All n-word windowsin your corpus

Training data for this task is easilyavailable (take all n word windowsfrom the whole of wikipedia)

For ease of illustration, we will firstfocus on the case when n = 2 (i.e.,predict second word based on firstword)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 91: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

25/70

Sometime in the 21st century, Joseph Cooper,a widowed former engineer and former NASA

pilot, runs a farm with his father-in-law Donald,son Tom, and daughter Murphy, It is post-truthsociety ( Cooper is reprimanded for tellingMurphy that the Apollo missions did indeedhappen) and a series of crop blights threatens hu-

manity’s survival. Murphy believes her bedroom

is haunted by a poltergeist. When a pattern

is created out of dust on the floor, Cooperrealizes that gravity is behind its formation,not a ”ghost”. He interprets the pattern asa set of geographic coordinates formed intobinary code. Cooper and Murphy follow thecoordinates to a secret NASA facility, where theyare met by Cooper’s former professor, Dr. Brand.

Some sample 4 word windows from a corpus

Consider this Task: Predict n-thword given previous n-1 words

Example: he sat on a chair

Training data: All n-word windowsin your corpus

Training data for this task is easilyavailable (take all n word windowsfrom the whole of wikipedia)

For ease of illustration, we will firstfocus on the case when n = 2 (i.e.,predict second word based on firstword)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 92: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

25/70

Sometime in the 21st century, Joseph Cooper,a widowed former engineer and former NASA

pilot, runs a farm with his father-in-law Donald,son Tom, and daughter Murphy, It is post-truthsociety ( Cooper is reprimanded for tellingMurphy that the Apollo missions did indeedhappen) and a series of crop blights threatens hu-

manity’s survival. Murphy believes her bedroom

is haunted by a poltergeist. When a pattern

is created out of dust on the floor, Cooperrealizes that gravity is behind its formation,not a ”ghost”. He interprets the pattern asa set of geographic coordinates formed intobinary code. Cooper and Murphy follow thecoordinates to a secret NASA facility, where theyare met by Cooper’s former professor, Dr. Brand.

Some sample 4 word windows from a corpus

Consider this Task: Predict n-thword given previous n-1 words

Example: he sat on a chair

Training data: All n-word windowsin your corpus

Training data for this task is easilyavailable (take all n word windowsfrom the whole of wikipedia)

For ease of illustration, we will firstfocus on the case when n = 2 (i.e.,predict second word based on firstword)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 93: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

26/70

We will now try to answer these two questions:

How do you model this task?

What is the connection between this task and learning word representations?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 94: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

26/70

We will now try to answer these two questions:

How do you model this task?

What is the connection between this task and learning word representations?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 95: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

26/70

We will now try to answer these two questions:

How do you model this task?

What is the connection between this task and learning word representations?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 96: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

27/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

We will model this problem using afeedforward neural network

Input: One-hot representation of thecontext word

Output: There are |V | words(classes) possible and we want to pre-dict a probability distribution overthese |V | classes (multi-class classific-ation problem)

Parameters: Wcontext ∈ Rk×|V | andWword ∈ Rk×|V |(we are assuming that the set ofwords and context words is thesame: each of size |V |)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 97: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

27/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

We will model this problem using afeedforward neural network

Input: One-hot representation of thecontext word

Output: There are |V | words(classes) possible and we want to pre-dict a probability distribution overthese |V | classes (multi-class classific-ation problem)

Parameters: Wcontext ∈ Rk×|V | andWword ∈ Rk×|V |(we are assuming that the set ofwords and context words is thesame: each of size |V |)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 98: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

27/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

We will model this problem using afeedforward neural network

Input: One-hot representation of thecontext word

Output: There are |V | words(classes) possible and we want to pre-dict a probability distribution overthese |V | classes (multi-class classific-ation problem)

Parameters: Wcontext ∈ Rk×|V | andWword ∈ Rk×|V |(we are assuming that the set ofwords and context words is thesame: each of size |V |)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 99: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

27/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

We will model this problem using afeedforward neural network

Input: One-hot representation of thecontext word

Output: There are |V | words(classes) possible and we want to pre-dict a probability distribution overthese |V | classes (multi-class classific-ation problem)

Parameters: Wcontext ∈ Rk×|V | andWword ∈ Rk×|V |(we are assuming that the set ofwords and context words is thesame: each of size |V |)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 100: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

28/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

What is the product Wcontextx given that xis a one hot vector

It is simply the i-th column of Wcontext −1 0.5 23 −1 −2−2 1.7 3

0

10

=

0.5−11.7

So when the ith word is present the ith ele-ment in the one hot vector is ON and the ith

column of Wcontext gets selected

In other words, there is a one-to-one corres-pondence between the words and the columnof Wcontext

More specifically, we can treat the i-thcolumn of Wcontext as the representation ofcontext i

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 101: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

28/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

What is the product Wcontextx given that xis a one hot vector

It is simply the i-th column of Wcontext −1 0.5 23 −1 −2−2 1.7 3

0

10

=

0.5−11.7

So when the ith word is present the ith ele-ment in the one hot vector is ON and the ith

column of Wcontext gets selected

In other words, there is a one-to-one corres-pondence between the words and the columnof Wcontext

More specifically, we can treat the i-thcolumn of Wcontext as the representation ofcontext i

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 102: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

28/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

What is the product Wcontextx given that xis a one hot vector

It is simply the i-th column of Wcontext −1 0.5 23 −1 −2−2 1.7 3

0

10

=

0.5−11.7

So when the ith word is present the ith ele-ment in the one hot vector is ON and the ith

column of Wcontext gets selected

In other words, there is a one-to-one corres-pondence between the words and the columnof Wcontext

More specifically, we can treat the i-thcolumn of Wcontext as the representation ofcontext i

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 103: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

28/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

What is the product Wcontextx given that xis a one hot vector

It is simply the i-th column of Wcontext −1 0.5 23 −1 −2−2 1.7 3

0

10

=

0.5−11.7

So when the ith word is present the ith ele-ment in the one hot vector is ON and the ith

column of Wcontext gets selected

In other words, there is a one-to-one corres-pondence between the words and the columnof Wcontext

More specifically, we can treat the i-thcolumn of Wcontext as the representation ofcontext i

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 104: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

28/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

What is the product Wcontextx given that xis a one hot vector

It is simply the i-th column of Wcontext −1 0.5 23 −1 −2−2 1.7 3

0

10

=

0.5−11.7

So when the ith word is present the ith ele-ment in the one hot vector is ON and the ith

column of Wcontext gets selected

In other words, there is a one-to-one corres-pondence between the words and the columnof Wcontext

More specifically, we can treat the i-thcolumn of Wcontext as the representation ofcontext i

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 105: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

29/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

P (on|sat) =e(Wwordh)[i]∑j e

(Wwordh)[j]

How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function?

(softmax)

Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext

and ith column of Wword

P (word = i|sat) thus depends on the ith

column of Wword

We thus treat the i-th column of Wword asthe representation of word i

Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext

and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)

Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 106: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

29/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

P (on|sat) =e(Wwordh)[i]∑j e

(Wwordh)[j]

How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)

Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext

and ith column of Wword

P (word = i|sat) thus depends on the ith

column of Wword

We thus treat the i-th column of Wword asthe representation of word i

Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext

and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)

Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 107: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

29/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

P (on|sat) =e(Wwordh)[i]∑j e

(Wwordh)[j]

How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)

Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext

and ith column of Wword

P (word = i|sat) thus depends on the ith

column of Wword

We thus treat the i-th column of Wword asthe representation of word i

Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext

and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)

Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 108: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

29/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

P (on|sat) =e(Wwordh)[i]∑j e

(Wwordh)[j]

How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)

Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext

and ith column of Wword

P (word = i|sat) thus depends on the ith

column of Wword

We thus treat the i-th column of Wword asthe representation of word i

Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext

and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)

Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 109: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

29/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

P (on|sat) =e(Wwordh)[i]∑j e

(Wwordh)[j]

How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)

Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext

and ith column of Wword

P (word = i|sat) thus depends on the ith

column of Wword

We thus treat the i-th column of Wword asthe representation of word i

Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext

and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)

Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 110: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

29/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

P (on|sat) =e(Wwordh)[i]∑j e

(Wwordh)[j]

How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)

Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext

and ith column of Wword

P (word = i|sat) thus depends on the ith

column of Wword

We thus treat the i-th column of Wword asthe representation of word i

Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext

and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)

Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 111: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

29/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

P (on|sat) =e(Wwordh)[i]∑j e

(Wwordh)[j]

How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)

Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext

and ith column of Wword

P (word = i|sat) thus depends on the ith

column of Wword

We thus treat the i-th column of Wword asthe representation of word i

Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext

and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)

Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 112: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

29/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

P (on|sat) =e(Wwordh)[i]∑j e

(Wwordh)[j]

How do we obtain P (on|sat)? For this multi-class classification problem what is an appro-priate output function? (softmax)

Therefore, P (on|sat) is proportional to thedot product between jth column of Wcontext

and ith column of Wword

P (word = i|sat) thus depends on the ith

column of Wword

We thus treat the i-th column of Wword asthe representation of word i

Hope you see an analogy with SVD! (therewe had a different way of learning Wcontext

and Wword but we saw that the ith columnof Wword corresponded to the representa-tion of the ith word)

Now that we understood the interpretationof Wcontext and Wword, our aim now is tolearn these parameters

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 113: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

30/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

y=P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w

For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ?

softmax

What is an appropriate loss function?

crossentropy

L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc

yw =exp(uc · vw)∑

w′∈V exp(uc · vw′)

uc is the column of Wcontext correspondingto context c and vw is the column of Wword

corresponding to context w

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 114: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

30/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

y=P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w

For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ?

softmax

What is an appropriate loss function?

crossentropy

L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc

yw =exp(uc · vw)∑

w′∈V exp(uc · vw′)

uc is the column of Wcontext correspondingto context c and vw is the column of Wword

corresponding to context w

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 115: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

30/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

y=P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w

For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ? softmax

What is an appropriate loss function?

crossentropy

L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc

yw =exp(uc · vw)∑

w′∈V exp(uc · vw′)

uc is the column of Wcontext correspondingto context c and vw is the column of Wword

corresponding to context w

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 116: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

30/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

y=P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w

For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ? softmax

What is an appropriate loss function?

crossentropy

L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc

yw =exp(uc · vw)∑

w′∈V exp(uc · vw′)

uc is the column of Wcontext correspondingto context c and vw is the column of Wword

corresponding to context w

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 117: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

30/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

y=P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w

For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ? softmax

What is an appropriate loss function? crossentropy

L (θ) = − log yw = − logP (w|c)

h = Wcontext · xc = uc

yw =exp(uc · vw)∑

w′∈V exp(uc · vw′)

uc is the column of Wcontext correspondingto context c and vw is the column of Wword

corresponding to context w

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 118: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

30/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

y=P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w

For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ? softmax

What is an appropriate loss function? crossentropy

L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc

yw =exp(uc · vw)∑

w′∈V exp(uc · vw′)

uc is the column of Wcontext correspondingto context c and vw is the column of Wword

corresponding to context w

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 119: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

30/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

y=P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w

For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ? softmax

What is an appropriate loss function? crossentropy

L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc

yw =exp(uc · vw)∑

w′∈V exp(uc · vw′)

uc is the column of Wcontext correspondingto context c and vw is the column of Wword

corresponding to context w

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 120: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

30/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

y=P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

We denote the context word (sat) by the in-dex c and the correct output word (on) bythe index w

For this multiclass classification problemwhat is an appropriate output function (y =f(x)) ? softmax

What is an appropriate loss function? crossentropy

L (θ) = − log yw = − logP (w|c)h = Wcontext · xc = uc

yw =exp(uc · vw)∑

w′∈V exp(uc · vw′)

uc is the column of Wcontext correspondingto context c and vw is the column of Wword

corresponding to context w

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 121: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

31/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

How do we train this simple feed for-ward neural network?

backpropaga-tion

Let us consider one input-output pair(c, w) and see the update rule for vw

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 122: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

31/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

How do we train this simple feed for-ward neural network? backpropaga-tion

Let us consider one input-output pair(c, w) and see the update rule for vw

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 123: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

31/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

How do we train this simple feed for-ward neural network? backpropaga-tion

Let us consider one input-output pair(c, w) and see the update rule for vw

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 124: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

32/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

∇vw = − ∂

∂vwL (θ)

L (θ) = − log yw

= − logexp(uc · vw)∑

w′∈V exp(uc · vw′)

= −(uc · vw − log∑w′∈V

exp(uc · vw′))

∇vw = −(uc −exp(uc · vw)∑

w′∈V exp(uc · vw′)· uc)

= −uc(1− yw)

And the update rule would be

vw = vw − η∇vw= vw + ηuc(1− yw)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 125: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

32/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

∇vw = − ∂

∂vwL (θ)

L (θ) = − log yw

= − logexp(uc · vw)∑

w′∈V exp(uc · vw′)

= −(uc · vw − log∑w′∈V

exp(uc · vw′))

∇vw = −(uc −exp(uc · vw)∑

w′∈V exp(uc · vw′)· uc)

= −uc(1− yw)

And the update rule would be

vw = vw − η∇vw= vw + ηuc(1− yw)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 126: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

32/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

∇vw = − ∂

∂vwL (θ)

L (θ) = − log yw

= − logexp(uc · vw)∑

w′∈V exp(uc · vw′)

= −(uc · vw − log∑w′∈V

exp(uc · vw′))

∇vw = −(uc −exp(uc · vw)∑

w′∈V exp(uc · vw′)· uc)

= −uc(1− yw)

And the update rule would be

vw = vw − η∇vw= vw + ηuc(1− yw)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 127: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

32/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

∇vw = − ∂

∂vwL (θ)

L (θ) = − log yw

= − logexp(uc · vw)∑

w′∈V exp(uc · vw′)

= −(uc · vw − log∑w′∈V

exp(uc · vw′))

∇vw = −(uc −exp(uc · vw)∑

w′∈V exp(uc · vw′)· uc)

= −uc(1− yw)

And the update rule would be

vw = vw − η∇vw= vw + ηuc(1− yw)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 128: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

32/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

∇vw = − ∂

∂vwL (θ)

L (θ) = − log yw

= − logexp(uc · vw)∑

w′∈V exp(uc · vw′)

= −(uc · vw − log∑w′∈V

exp(uc · vw′))

∇vw = −(uc −exp(uc · vw)∑

w′∈V exp(uc · vw′)· uc)

= −uc(1− yw)

And the update rule would be

vw = vw − η∇vw= vw + ηuc(1− yw)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 129: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

32/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

∇vw = − ∂

∂vwL (θ)

L (θ) = − log yw

= − logexp(uc · vw)∑

w′∈V exp(uc · vw′)

= −(uc · vw − log∑w′∈V

exp(uc · vw′))

∇vw = −(uc −exp(uc · vw)∑

w′∈V exp(uc · vw′)· uc)

= −uc(1− yw)

And the update rule would be

vw = vw − η∇vw= vw + ηuc(1− yw)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 130: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

32/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

∇vw = − ∂

∂vwL (θ)

L (θ) = − log yw

= − logexp(uc · vw)∑

w′∈V exp(uc · vw′)

= −(uc · vw − log∑w′∈V

exp(uc · vw′))

∇vw = −(uc −exp(uc · vw)∑

w′∈V exp(uc · vw′)· uc)

= −uc(1− yw)

And the update rule would be

vw = vw − η∇vw

= vw + ηuc(1− yw)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 131: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

32/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

∇vw = − ∂

∂vwL (θ)

L (θ) = − log yw

= − logexp(uc · vw)∑

w′∈V exp(uc · vw′)

= −(uc · vw − log∑w′∈V

exp(uc · vw′))

∇vw = −(uc −exp(uc · vw)∑

w′∈V exp(uc · vw′)· uc)

= −uc(1− yw)

And the update rule would be

vw = vw − η∇vw= vw + ηuc(1− yw)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 132: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

33/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

This update rule has a nice interpret-ation

vw = vw + ηuc(1− yw)

If yw → 1 then we are already predict-ing the right word and vw will not beupdated

If yw → 0 then vw gets updated byadding a fraction of uc to it

This increases the cosine similaritybetween vw and uc (How? Refer toslide 38 of Lecture 2)

The training objective ensures thatthe cosine similarity between word(vw) and context word (uc) is max-imized

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 133: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

33/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

This update rule has a nice interpret-ation

vw = vw + ηuc(1− yw)

If yw → 1 then we are already predict-ing the right word and vw will not beupdated

If yw → 0 then vw gets updated byadding a fraction of uc to it

This increases the cosine similaritybetween vw and uc (How? Refer toslide 38 of Lecture 2)

The training objective ensures thatthe cosine similarity between word(vw) and context word (uc) is max-imized

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 134: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

33/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

This update rule has a nice interpret-ation

vw = vw + ηuc(1− yw)

If yw → 1 then we are already predict-ing the right word and vw will not beupdated

If yw → 0 then vw gets updated byadding a fraction of uc to it

This increases the cosine similaritybetween vw and uc (How? Refer toslide 38 of Lecture 2)

The training objective ensures thatthe cosine similarity between word(vw) and context word (uc) is max-imized

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 135: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

33/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

This update rule has a nice interpret-ation

vw = vw + ηuc(1− yw)

If yw → 1 then we are already predict-ing the right word and vw will not beupdated

If yw → 0 then vw gets updated byadding a fraction of uc to it

This increases the cosine similaritybetween vw and uc (How? Refer toslide 38 of Lecture 2)

The training objective ensures thatthe cosine similarity between word(vw) and context word (uc) is max-imized

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 136: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

33/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

This update rule has a nice interpret-ation

vw = vw + ηuc(1− yw)

If yw → 1 then we are already predict-ing the right word and vw will not beupdated

If yw → 0 then vw gets updated byadding a fraction of uc to it

This increases the cosine similaritybetween vw and uc (How? Refer toslide 38 of Lecture 2)

The training objective ensures thatthe cosine similarity between word(vw) and context word (uc) is max-imized

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 137: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

34/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

What happens to the representationsof two words w and w′ which tend toappear in similar context (c)

The training ensures that both vwand v′w have a high cosine similaritywith uc and hence transitively (intu-itively) ensures that vw and v′w have ahigh cosine similarity with each other

This is only an intuition (reasonable)

Haven’t come across a formal prooffor this!

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 138: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

34/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

What happens to the representationsof two words w and w′ which tend toappear in similar context (c)

The training ensures that both vwand v′w have a high cosine similaritywith uc and hence transitively (intu-itively) ensures that vw and v′w have ahigh cosine similarity with each other

This is only an intuition (reasonable)

Haven’t come across a formal prooffor this!

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 139: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

34/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

What happens to the representationsof two words w and w′ which tend toappear in similar context (c)

The training ensures that both vwand v′w have a high cosine similaritywith uc and hence transitively (intu-itively) ensures that vw and v′w have ahigh cosine similarity with each other

This is only an intuition (reasonable)

Haven’t come across a formal prooffor this!

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 140: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

34/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat)

P(chair|sat)

P(man|sat)

P(on|sat)

. . . . . . . . .

h ∈ Rk

Wword ∈ Rk×|V |

x ∈ R|V |

Wcontext ∈

Rk×|V |

What happens to the representationsof two words w and w′ which tend toappear in similar context (c)

The training ensures that both vwand v′w have a high cosine similaritywith uc and hence transitively (intu-itively) ensures that vw and v′w have ahigh cosine similarity with each other

This is only an intuition (reasonable)

Haven’t come across a formal prooffor this!

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 141: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

35/70

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat,he)

P(chair|sat,he)

P(man|sat,he)

P(on|sat,he)

. . . . . . . . .

he sat

h ∈ Rk

Wword ∈

Rk×2|V |

x ∈ R2|V |

[Wcontext,Wcontext] ∈

Rk×2|V |

In practice, instead of window size of 1 it iscommon to use a window size of d

So now,

h =

d−1∑i=1

uci

[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix

−1 0.5 2

− 1 0.5 2

3 −1 −2

3 −1 −2

−2 1.7 3

− 2 1.7 3

0

1000

1

} sat

}he

=

2.5−34.7

The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 142: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

35/70

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat,he)

P(chair|sat,he)

P(man|sat,he)

P(on|sat,he)

. . . . . . . . .

he sat

h ∈ Rk

Wword ∈

Rk×2|V |

x ∈ R2|V |

[Wcontext,Wcontext] ∈

Rk×2|V |

In practice, instead of window size of 1 it iscommon to use a window size of d

So now,

h =

d−1∑i=1

uci

[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix

−1 0.5 2

− 1 0.5 2

3 −1 −2

3 −1 −2

−2 1.7 3

− 2 1.7 3

0

1000

1

} sat

}he

=

2.5−34.7

The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 143: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

35/70

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat,he)

P(chair|sat,he)

P(man|sat,he)

P(on|sat,he)

. . . . . . . . .

he sat

h ∈ Rk

Wword ∈

Rk×2|V |

x ∈ R2|V |

[Wcontext,Wcontext] ∈

Rk×2|V |

In practice, instead of window size of 1 it iscommon to use a window size of d

So now,

h =

d−1∑i=1

uci

[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix

−1 0.5 2

− 1 0.5 2

3 −1 −2

3 −1 −2

−2 1.7 3

− 2 1.7 3

0

1000

1

} sat

}he

=

2.5−34.7

The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 144: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

35/70

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat,he)

P(chair|sat,he)

P(man|sat,he)

P(on|sat,he)

. . . . . . . . .

he sat

h ∈ Rk

Wword ∈

Rk×2|V |

x ∈ R2|V |

[Wcontext,Wcontext] ∈

Rk×2|V |

In practice, instead of window size of 1 it iscommon to use a window size of d

So now,

h =

d−1∑i=1

uci

[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix

−1 0.5 2

− 1 0.5 2

3 −1 −2

3 −1 −2

−2 1.7 3

− 2 1.7 3

0

1000

1

} sat

}he

=

2.5−34.7

The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 145: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

35/70

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat,he)

P(chair|sat,he)

P(man|sat,he)

P(on|sat,he)

. . . . . . . . .

he sat

h ∈ Rk

Wword ∈

Rk×2|V |

x ∈ R2|V |

[Wcontext,Wcontext] ∈

Rk×2|V |

In practice, instead of window size of 1 it iscommon to use a window size of d

So now,

h =

d−1∑i=1

uci

[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix

−1 0.5 2 − 1 0.5 23 −1 −2 3 −1 −2−2 1.7 3 − 2 1.7 3

0

1000

1

} sat

}he

=

2.5−34.7

The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 146: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

35/70

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat,he)

P(chair|sat,he)

P(man|sat,he)

P(on|sat,he)

. . . . . . . . .

he sat

h ∈ Rk

Wword ∈

Rk×2|V |

x ∈ R2|V |

[Wcontext,Wcontext] ∈

Rk×2|V |

In practice, instead of window size of 1 it iscommon to use a window size of d

So now,

h =

d−1∑i=1

uci

[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix

−1 0.5 2 − 1 0.5 23 −1 −2 3 −1 −2−2 1.7 3 − 2 1.7 3

0

1000

1

} sat

}he

=

2.5−34.7

The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 147: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

35/70

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat,he)

P(chair|sat,he)

P(man|sat,he)

P(on|sat,he)

. . . . . . . . .

he sat

h ∈ Rk

Wword ∈

Rk×2|V |

x ∈ R2|V |

[Wcontext,Wcontext] ∈

Rk×2|V |

In practice, instead of window size of 1 it iscommon to use a window size of d

So now,

h =

d−1∑i=1

uci

[Wcontext,Wcontext] just means that we arestacking 2 copies of Wcontext matrix

−1 0.5 2 − 1 0.5 23 −1 −2 3 −1 −2−2 1.7 3 − 2 1.7 3

0

1000

1

} sat

}he

=

2.5−34.7

The resultant product would simply be thesum of the columns corresponding to ‘sat’and ‘he’

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 148: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

36/70

Of course in practice we will not do this expensive matrix multiplication

If ‘he’ is ith word in the vocabulary and sat is the jth word then we willsimply access columns W[i :] and W[j :] and add them

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 149: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

36/70

Of course in practice we will not do this expensive matrix multiplication

If ‘he’ is ith word in the vocabulary and sat is the jth word then we willsimply access columns W[i :] and W[j :] and add them

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 150: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

37/70

Now what happens during backpropagation

Recall that

h =d−1∑i=1

uci

and

P (on|sat, he) =e(wwordh)[k]∑j e

(wwordh)[j]

where ‘k’ is the index of the word ‘on’

The loss function depends on {Wword, uc1 , uc2 , . . . , ucd−1} and all these

parameters will get updated during backpropogation

Try deriving the update rule for vw now and see how it differs from the one wederived before

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 151: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

37/70

Now what happens during backpropagation

Recall that

h =

d−1∑i=1

uci

and

P (on|sat, he) =e(wwordh)[k]∑j e

(wwordh)[j]

where ‘k’ is the index of the word ‘on’

The loss function depends on {Wword, uc1 , uc2 , . . . , ucd−1} and all these

parameters will get updated during backpropogation

Try deriving the update rule for vw now and see how it differs from the one wederived before

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 152: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

37/70

Now what happens during backpropagation

Recall that

h =

d−1∑i=1

uci

and

P (on|sat, he) =e(wwordh)[k]∑j e

(wwordh)[j]

where ‘k’ is the index of the word ‘on’

The loss function depends on {Wword, uc1 , uc2 , . . . , ucd−1} and all these

parameters will get updated during backpropogation

Try deriving the update rule for vw now and see how it differs from the one wederived before

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 153: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

37/70

Now what happens during backpropagation

Recall that

h =

d−1∑i=1

uci

and

P (on|sat, he) =e(wwordh)[k]∑j e

(wwordh)[j]

where ‘k’ is the index of the word ‘on’

The loss function depends on {Wword, uc1 , uc2 , . . . , ucd−1} and all these

parameters will get updated during backpropogation

Try deriving the update rule for vw now and see how it differs from the one wederived before

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 154: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

37/70

Now what happens during backpropagation

Recall that

h =

d−1∑i=1

uci

and

P (on|sat, he) =e(wwordh)[k]∑j e

(wwordh)[j]

where ‘k’ is the index of the word ‘on’

The loss function depends on {Wword, uc1 , uc2 , . . . , ucd−1} and all these

parameters will get updated during backpropogation

Try deriving the update rule for vw now and see how it differs from the one wederived before

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 155: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

37/70

Now what happens during backpropagation

Recall that

h =

d−1∑i=1

uci

and

P (on|sat, he) =e(wwordh)[k]∑j e

(wwordh)[j]

where ‘k’ is the index of the word ‘on’

The loss function depends on {Wword, uc1 , uc2 , . . . , ucd−1} and all these

parameters will get updated during backpropogation

Try deriving the update rule for vw now and see how it differs from the one wederived before

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 156: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

38/70

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat,he)

P(chair|sat,he)

P(man|sat,he)

P(on|sat,he)

. . . . . . . . .

he sat

h ∈ Rk

Wword ∈

Rk×2|V |

x ∈ R2|V |

[Wcontext,Wcontext] ∈

Rk×2|V |

Some problems:

Notice that the softmax function atthe output is computationally veryexpensive

yw =exp(uc · vw)∑

w′∈V exp(uc · vw′)

The denominator requires a summa-tion over all words in the vocabulary

We will revisit this issue soon

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 157: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

38/70

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat,he)

P(chair|sat,he)

P(man|sat,he)

P(on|sat,he)

. . . . . . . . .

he sat

h ∈ Rk

Wword ∈

Rk×2|V |

x ∈ R2|V |

[Wcontext,Wcontext] ∈

Rk×2|V |

Some problems:

Notice that the softmax function atthe output is computationally veryexpensive

yw =exp(uc · vw)∑

w′∈V exp(uc · vw′)

The denominator requires a summa-tion over all words in the vocabulary

We will revisit this issue soon

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 158: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

38/70

. . . . . . . . . .

. . . . . . . . . . . .

P(he|sat,he)

P(chair|sat,he)

P(man|sat,he)

P(on|sat,he)

. . . . . . . . .

he sat

h ∈ Rk

Wword ∈

Rk×2|V |

x ∈ R2|V |

[Wcontext,Wcontext] ∈

Rk×2|V |

Some problems:

Notice that the softmax function atthe output is computationally veryexpensive

yw =exp(uc · vw)∑

w′∈V exp(uc · vw′)

The denominator requires a summa-tion over all words in the vocabulary

We will revisit this issue soon

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 159: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

39/70

Module 10.5: Skip-gram model

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 160: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

40/70

The model that we just saw is called the continuous bag of words model (itpredicts an output word give a bag of context words)

We will now see the skip gram model (which predicts context words given aninput word)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 161: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

40/70

The model that we just saw is called the continuous bag of words model (itpredicts an output word give a bag of context words)

We will now see the skip gram model (which predicts context words given aninput word)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 162: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

41/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Notice that the role of context andword has changed now

In the simple case when there is onlyone context word, we will arrive atthe same update rule for uc as we didfor vw earlier

Notice that even when we have mul-tiple context words the loss functionwould just be a summation of manycross entropy errors

L (θ) = −d−1∑i=1

log ywi

Typically, we predict context wordson both sides of the given word

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 163: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

41/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Notice that the role of context andword has changed now

In the simple case when there is onlyone context word, we will arrive atthe same update rule for uc as we didfor vw earlier

Notice that even when we have mul-tiple context words the loss functionwould just be a summation of manycross entropy errors

L (θ) = −d−1∑i=1

log ywi

Typically, we predict context wordson both sides of the given word

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 164: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

41/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Notice that the role of context andword has changed now

In the simple case when there is onlyone context word, we will arrive atthe same update rule for uc as we didfor vw earlier

Notice that even when we have mul-tiple context words the loss functionwould just be a summation of manycross entropy errors

L (θ) = −d−1∑i=1

log ywi

Typically, we predict context wordson both sides of the given word

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 165: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

41/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Notice that the role of context andword has changed now

In the simple case when there is onlyone context word, we will arrive atthe same update rule for uc as we didfor vw earlier

Notice that even when we have mul-tiple context words the loss functionwould just be a summation of manycross entropy errors

L (θ) = −d−1∑i=1

log ywi

Typically, we predict context wordson both sides of the given word

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 166: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

42/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Some problems

Same as bag of words

The softmax function at the outputis computationally expensive

Solution 1: Use negative sampling

Solution 2: Use contrastive estima-tion

Solution 3: Use hierarchical softmax

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 167: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

42/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Some problems

Same as bag of words

The softmax function at the outputis computationally expensive

Solution 1: Use negative sampling

Solution 2: Use contrastive estima-tion

Solution 3: Use hierarchical softmax

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 168: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

42/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Some problems

Same as bag of words

The softmax function at the outputis computationally expensive

Solution 1: Use negative sampling

Solution 2: Use contrastive estima-tion

Solution 3: Use hierarchical softmax

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 169: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

42/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Some problems

Same as bag of words

The softmax function at the outputis computationally expensive

Solution 1: Use negative sampling

Solution 2: Use contrastive estima-tion

Solution 3: Use hierarchical softmax

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 170: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

42/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Some problems

Same as bag of words

The softmax function at the outputis computationally expensive

Solution 1: Use negative sampling

Solution 2: Use contrastive estima-tion

Solution 3: Use hierarchical softmax

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 171: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

42/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Some problems

Same as bag of words

The softmax function at the outputis computationally expensive

Solution 1: Use negative sampling

Solution 2: Use contrastive estima-tion

Solution 3: Use hierarchical softmax

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 172: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

43/70

D = [(sat, on), (sat,a), (sat, chair), (on,a), (on,chair), (a,chair),(on,sat), (a, sat),(chair,sat), (a, on),(chair, on), (chair, a) ]

D′

= [(sat, oxygen),(sat, magic), (chair,sad), (chair, walking)]

Let D be the set of all correct (w, c) pairs in thecorpus

Let D′

be the set of all incorrect (w, r) pairs inthe corpus

D′

can be constructed by randomly sampling acontext word r which has never appeared with wand creating a pair (w, r)

As before let vw be the representation of the wordw and uc be the representation of the context wordc

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 173: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

43/70

D = [(sat, on), (sat,a), (sat, chair), (on,a), (on,chair), (a,chair),(on,sat), (a, sat),(chair,sat), (a, on),(chair, on), (chair, a) ]

D′

= [(sat, oxygen),(sat, magic), (chair,sad), (chair, walking)]

Let D be the set of all correct (w, c) pairs in thecorpus

Let D′

be the set of all incorrect (w, r) pairs inthe corpus

D′

can be constructed by randomly sampling acontext word r which has never appeared with wand creating a pair (w, r)

As before let vw be the representation of the wordw and uc be the representation of the context wordc

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 174: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

43/70

D = [(sat, on), (sat,a), (sat, chair), (on,a), (on,chair), (a,chair),(on,sat), (a, sat),(chair,sat), (a, on),(chair, on), (chair, a) ]

D′

= [(sat, oxygen),(sat, magic), (chair,sad), (chair, walking)]

Let D be the set of all correct (w, c) pairs in thecorpus

Let D′

be the set of all incorrect (w, r) pairs inthe corpus

D′

can be constructed by randomly sampling acontext word r which has never appeared with wand creating a pair (w, r)

As before let vw be the representation of the wordw and uc be the representation of the context wordc

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 175: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

43/70

D = [(sat, on), (sat,a), (sat, chair), (on,a), (on,chair), (a,chair),(on,sat), (a, sat),(chair,sat), (a, on),(chair, on), (chair, a) ]

D′

= [(sat, oxygen),(sat, magic), (chair,sad), (chair, walking)]

Let D be the set of all correct (w, c) pairs in thecorpus

Let D′

be the set of all incorrect (w, r) pairs inthe corpus

D′

can be constructed by randomly sampling acontext word r which has never appeared with wand creating a pair (w, r)

As before let vw be the representation of the wordw and uc be the representation of the context wordc

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 176: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

44/70

·

σ

P (z = 1|w, c)

uc vw

For a given (w, c) ∈ D we are interested in max-imizing

p(z = 1|w, c)

Let us model this probability by

p(z = 1|w, c) = σ(uTc vw)

=1

1 + e−uTc vw

Considering all (w, c) ∈ D, we are interested in

maximizeθ

∏(w,c)∈D

p(z = 1|w, c)

where θ is the word representation (vw) and con-text representation (uc) for all words in our corpus

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 177: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

44/70

·

σ

P (z = 1|w, c)

uc vw

For a given (w, c) ∈ D we are interested in max-imizing

p(z = 1|w, c)

Let us model this probability by

p(z = 1|w, c) = σ(uTc vw)

=1

1 + e−uTc vw

Considering all (w, c) ∈ D, we are interested in

maximizeθ

∏(w,c)∈D

p(z = 1|w, c)

where θ is the word representation (vw) and con-text representation (uc) for all words in our corpus

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 178: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

44/70

·

σ

P (z = 1|w, c)

uc vw

For a given (w, c) ∈ D we are interested in max-imizing

p(z = 1|w, c)

Let us model this probability by

p(z = 1|w, c) = σ(uTc vw)

=1

1 + e−uTc vw

Considering all (w, c) ∈ D, we are interested in

maximizeθ

∏(w,c)∈D

p(z = 1|w, c)

where θ is the word representation (vw) and con-text representation (uc) for all words in our corpus

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 179: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

45/70

·

σ

P (z = 0|w, r)

ur vw

For (w, r) ∈ D′ we are interested in maximizing

p(z = 0|w, r)

Again we model this as

p(z = 0|w, r) = 1− σ(uTr vw)

= 1− 1

1 + e−vTr vw

=1

1 + euTr vw= σ(−uTr vw)

Considering all (w, r) ∈ D′ , we are interested in

maximizeθ

∏(w,r)∈D′

p(z = 0|w, r)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 180: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

45/70

·

σ

P (z = 0|w, r)

ur vw

For (w, r) ∈ D′ we are interested in maximizing

p(z = 0|w, r)

Again we model this as

p(z = 0|w, r) = 1− σ(uTr vw)

= 1− 1

1 + e−vTr vw

=1

1 + euTr vw= σ(−uTr vw)

Considering all (w, r) ∈ D′ , we are interested in

maximizeθ

∏(w,r)∈D′

p(z = 0|w, r)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 181: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

45/70

·

σ

P (z = 0|w, r)

ur vw

For (w, r) ∈ D′ we are interested in maximizing

p(z = 0|w, r)

Again we model this as

p(z = 0|w, r) = 1− σ(uTr vw)

= 1− 1

1 + e−vTr vw

=1

1 + euTr vw= σ(−uTr vw)

Considering all (w, r) ∈ D′ , we are interested in

maximizeθ

∏(w,r)∈D′

p(z = 0|w, r)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 182: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

45/70

·

σ

P (z = 0|w, r)

ur vw

For (w, r) ∈ D′ we are interested in maximizing

p(z = 0|w, r)

Again we model this as

p(z = 0|w, r) = 1− σ(uTr vw)

= 1− 1

1 + e−vTr vw

=1

1 + euTr vw

= σ(−uTr vw)

Considering all (w, r) ∈ D′ , we are interested in

maximizeθ

∏(w,r)∈D′

p(z = 0|w, r)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 183: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

45/70

·

σ

P (z = 0|w, r)

ur vw

For (w, r) ∈ D′ we are interested in maximizing

p(z = 0|w, r)

Again we model this as

p(z = 0|w, r) = 1− σ(uTr vw)

= 1− 1

1 + e−vTr vw

=1

1 + euTr vw= σ(−uTr vw)

Considering all (w, r) ∈ D′ , we are interested in

maximizeθ

∏(w,r)∈D′

p(z = 0|w, r)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 184: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

45/70

·

σ

P (z = 0|w, r)

ur vw

For (w, r) ∈ D′ we are interested in maximizing

p(z = 0|w, r)

Again we model this as

p(z = 0|w, r) = 1− σ(uTr vw)

= 1− 1

1 + e−vTr vw

=1

1 + euTr vw= σ(−uTr vw)

Considering all (w, r) ∈ D′ , we are interested in

maximizeθ

∏(w,r)∈D′

p(z = 0|w, r)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 185: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

46/70

·

σ

P (z = 0|w, r)

ur vw

Combining the two we get:

maximizeθ

∏(w,c)∈D

p(z = 1|w, c)∏

(w,r)∈D′p(z = 0|w, r)

=maximizeθ

∏(w,c)∈D

p(z = 1|w, c)∏

(w,r)∈D′(1− p(z = 1|w, r))

=maximizeθ

∑(w,c)∈D

log p(z = 1|w, c)

+∑

(w,r)∈D′log(1− p(z = 1|w, r))

=maximizeθ

∑(w,c)∈D

log1

1 + e−vTc vw

+∑

(w,r)∈D′log

1

1 + evTr vw

=maximizeθ

∑(w,c)∈D

log σ(vTc vw) +∑

(w,r)∈D′log σ(−vTr vw)

where σ(x) = 11+e−x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 186: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

46/70

·

σ

P (z = 0|w, r)

ur vw

Combining the two we get:

maximizeθ

∏(w,c)∈D

p(z = 1|w, c)∏

(w,r)∈D′p(z = 0|w, r)

=maximizeθ

∏(w,c)∈D

p(z = 1|w, c)∏

(w,r)∈D′(1− p(z = 1|w, r))

=maximizeθ

∑(w,c)∈D

log p(z = 1|w, c)

+∑

(w,r)∈D′log(1− p(z = 1|w, r))

=maximizeθ

∑(w,c)∈D

log1

1 + e−vTc vw

+∑

(w,r)∈D′log

1

1 + evTr vw

=maximizeθ

∑(w,c)∈D

log σ(vTc vw) +∑

(w,r)∈D′log σ(−vTr vw)

where σ(x) = 11+e−x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 187: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

46/70

·

σ

P (z = 0|w, r)

ur vw

Combining the two we get:

maximizeθ

∏(w,c)∈D

p(z = 1|w, c)∏

(w,r)∈D′p(z = 0|w, r)

=maximizeθ

∏(w,c)∈D

p(z = 1|w, c)∏

(w,r)∈D′(1− p(z = 1|w, r))

=maximizeθ

∑(w,c)∈D

log p(z = 1|w, c)

+∑

(w,r)∈D′log(1− p(z = 1|w, r))

=maximizeθ

∑(w,c)∈D

log1

1 + e−vTc vw

+∑

(w,r)∈D′log

1

1 + evTr vw

=maximizeθ

∑(w,c)∈D

log σ(vTc vw) +∑

(w,r)∈D′log σ(−vTr vw)

where σ(x) = 11+e−x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 188: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

46/70

·

σ

P (z = 0|w, r)

ur vw

Combining the two we get:

maximizeθ

∏(w,c)∈D

p(z = 1|w, c)∏

(w,r)∈D′p(z = 0|w, r)

=maximizeθ

∏(w,c)∈D

p(z = 1|w, c)∏

(w,r)∈D′(1− p(z = 1|w, r))

=maximizeθ

∑(w,c)∈D

log p(z = 1|w, c)

+∑

(w,r)∈D′log(1− p(z = 1|w, r))

=maximizeθ

∑(w,c)∈D

log1

1 + e−vTc vw

+∑

(w,r)∈D′log

1

1 + evTr vw

=maximizeθ

∑(w,c)∈D

log σ(vTc vw) +∑

(w,r)∈D′log σ(−vTr vw)

where σ(x) = 11+e−x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 189: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

46/70

·

σ

P (z = 0|w, r)

ur vw

Combining the two we get:

maximizeθ

∏(w,c)∈D

p(z = 1|w, c)∏

(w,r)∈D′p(z = 0|w, r)

=maximizeθ

∏(w,c)∈D

p(z = 1|w, c)∏

(w,r)∈D′(1− p(z = 1|w, r))

=maximizeθ

∑(w,c)∈D

log p(z = 1|w, c)

+∑

(w,r)∈D′log(1− p(z = 1|w, r))

=maximizeθ

∑(w,c)∈D

log1

1 + e−vTc vw

+∑

(w,r)∈D′log

1

1 + evTr vw

=maximizeθ

∑(w,c)∈D

log σ(vTc vw) +∑

(w,r)∈D′log σ(−vTr vw)

where σ(x) = 11+e−x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 190: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

47/70

·

σ

P (z = 0|w, r)

ur vw

In the original paper, Mikolov et. al. sample knegative (w, r) pairs for every positive (w, c) pairs

The size of D′

is thus k times the size of D

The random context word is drawn from a modi-fied unigram distribution

r ∼ p(r)34

r ∼ count(r)34

N

N = total number of words in the corpus

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 191: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

47/70

·

σ

P (z = 0|w, r)

ur vw

In the original paper, Mikolov et. al. sample knegative (w, r) pairs for every positive (w, c) pairs

The size of D′

is thus k times the size of D

The random context word is drawn from a modi-fied unigram distribution

r ∼ p(r)34

r ∼ count(r)34

N

N = total number of words in the corpus

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 192: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

47/70

·

σ

P (z = 0|w, r)

ur vw

In the original paper, Mikolov et. al. sample knegative (w, r) pairs for every positive (w, c) pairs

The size of D′

is thus k times the size of D

The random context word is drawn from a modi-fied unigram distribution

r ∼ p(r)34

r ∼ count(r)34

N

N = total number of words in the corpus

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 193: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

47/70

·

σ

P (z = 0|w, r)

ur vw

In the original paper, Mikolov et. al. sample knegative (w, r) pairs for every positive (w, c) pairs

The size of D′

is thus k times the size of D

The random context word is drawn from a modi-fied unigram distribution

r ∼ p(r)34

r ∼ count(r)34

N

N = total number of words in the corpus

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 194: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

48/70

Module 10.6: Contrastive estimation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 195: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

49/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Some problems

Same as bag of words

The softmax function at the outputis computationally expensive

Solution 1: Use negative sampling

Solution 2: Use contrastive estima-tion

Solution 3: Use hierarchical softmax

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 196: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

50/70

Positive: He sat on a chair

. . . . . . . . . .

vc vw

sat on

Wh ∈

R2d×h

Wout ∈ Rh×|1|

s

We would like s to be greater than sc

Okay, so let us try to maximize s− scBut we would like the difference to be atleast m

Negative: He sat abracadabra a chair

. . . . . . . . . .

vc vw

sat abracadabra

Wh ∈

R2d×h

Wout ∈ Rh×|1|

sc

So we can maximize s− (sc +m)

What if s > sc + m

(don’t do any thing)

maximize max(0, s− (sc +m))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 197: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

50/70

Positive: He sat on a chair

. . . . . . . . . .

vc vw

sat on

Wh ∈

R2d×h

Wout ∈ Rh×|1|

s

We would like s to be greater than sc

Okay, so let us try to maximize s− scBut we would like the difference to be atleast m

Negative: He sat abracadabra a chair

. . . . . . . . . .

vc vw

sat abracadabra

Wh ∈

R2d×h

Wout ∈ Rh×|1|

sc

So we can maximize s− (sc +m)

What if s > sc + m

(don’t do any thing)

maximize max(0, s− (sc +m))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 198: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

50/70

Positive: He sat on a chair

. . . . . . . . . .

vc vw

sat on

Wh ∈

R2d×h

Wout ∈ Rh×|1|

s

We would like s to be greater than sc

Okay, so let us try to maximize s− scBut we would like the difference to be atleast m

Negative: He sat abracadabra a chair

. . . . . . . . . .

vc vw

sat abracadabra

Wh ∈

R2d×h

Wout ∈ Rh×|1|

sc

So we can maximize s− (sc +m)

What if s > sc + m

(don’t do any thing)

maximize max(0, s− (sc +m))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 199: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

50/70

Positive: He sat on a chair

. . . . . . . . . .

vc vw

sat on

Wh ∈

R2d×h

Wout ∈ Rh×|1|

s

We would like s to be greater than sc

Okay, so let us try to maximize s− sc

But we would like the difference to be atleast m

Negative: He sat abracadabra a chair

. . . . . . . . . .

vc vw

sat abracadabra

Wh ∈

R2d×h

Wout ∈ Rh×|1|

sc

So we can maximize s− (sc +m)

What if s > sc + m

(don’t do any thing)

maximize max(0, s− (sc +m))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 200: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

50/70

Positive: He sat on a chair

. . . . . . . . . .

vc vw

sat on

Wh ∈

R2d×h

Wout ∈ Rh×|1|

s

We would like s to be greater than sc

Okay, so let us try to maximize s− scBut we would like the difference to be atleast m

Negative: He sat abracadabra a chair

. . . . . . . . . .

vc vw

sat abracadabra

Wh ∈

R2d×h

Wout ∈ Rh×|1|

sc

So we can maximize s− (sc +m)

What if s > sc + m

(don’t do any thing)

maximize max(0, s− (sc +m))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 201: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

50/70

Positive: He sat on a chair

. . . . . . . . . .

vc vw

sat on

Wh ∈

R2d×h

Wout ∈ Rh×|1|

s

We would like s to be greater than sc

Okay, so let us try to maximize s− scBut we would like the difference to be atleast m

Negative: He sat abracadabra a chair

. . . . . . . . . .

vc vw

sat abracadabra

Wh ∈

R2d×h

Wout ∈ Rh×|1|

sc

So we can maximize s− (sc +m)

What if s > sc + m

(don’t do any thing)

maximize max(0, s− (sc +m))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 202: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

50/70

Positive: He sat on a chair

. . . . . . . . . .

vc vw

sat on

Wh ∈

R2d×h

Wout ∈ Rh×|1|

s

We would like s to be greater than sc

Okay, so let us try to maximize s− scBut we would like the difference to be atleast m

Negative: He sat abracadabra a chair

. . . . . . . . . .

vc vw

sat abracadabra

Wh ∈

R2d×h

Wout ∈ Rh×|1|

sc

So we can maximize s− (sc +m)

What if s > sc + m

(don’t do any thing)

maximize max(0, s− (sc +m))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 203: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

50/70

Positive: He sat on a chair

. . . . . . . . . .

vc vw

sat on

Wh ∈

R2d×h

Wout ∈ Rh×|1|

s

We would like s to be greater than sc

Okay, so let us try to maximize s− scBut we would like the difference to be atleast m

Negative: He sat abracadabra a chair

. . . . . . . . . .

vc vw

sat abracadabra

Wh ∈

R2d×h

Wout ∈ Rh×|1|

sc

So we can maximize s− (sc +m)

What if s > sc + m (don’t do any thing)

maximize max(0, s− (sc +m))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 204: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

50/70

Positive: He sat on a chair

. . . . . . . . . .

vc vw

sat on

Wh ∈

R2d×h

Wout ∈ Rh×|1|

s

We would like s to be greater than sc

Okay, so let us try to maximize s− scBut we would like the difference to be atleast m

Negative: He sat abracadabra a chair

. . . . . . . . . .

vc vw

sat abracadabra

Wh ∈

R2d×h

Wout ∈ Rh×|1|

sc

So we can maximize s− (sc +m)

What if s > sc + m (don’t do any thing)

maximize max(0, s− (sc +m))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 205: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

51/70

Module 10.7: Hierarchical softmax

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 206: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

52/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Some problems

Same as bag of words

The softmax function at the outputis computationally expensive

Solution 1: Use negative sampling

Solution 2: Use contrastive estima-tion

Solution 3: Use hierarchical softmax

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 207: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

53/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .max e

vTc uw∑

|V |evTc uw

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary

There exists a unique path from the rootnode to a leaf node.

Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w

Let π(w) be a binary vector such that:

π(w)k = 1 path branches left at node l(wk)

= 0 otherwise

Finally each internal node is associated witha vector ui

So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 208: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

53/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .max e

vTc uw∑

|V |evTc uw

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary

There exists a unique path from the rootnode to a leaf node.

Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w

Let π(w) be a binary vector such that:

π(w)k = 1 path branches left at node l(wk)

= 0 otherwise

Finally each internal node is associated witha vector ui

So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 209: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

53/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .max e

vTc uw∑

|V |evTc uw

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary

There exists a unique path from the rootnode to a leaf node.

Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w

Let π(w) be a binary vector such that:

π(w)k = 1 path branches left at node l(wk)

= 0 otherwise

Finally each internal node is associated witha vector ui

So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 210: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

53/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .max e

vTc uw∑

|V |evTc uw

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary

There exists a unique path from the rootnode to a leaf node.

Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w

Let π(w) be a binary vector such that:

π(w)k = 1 path branches left at node l(wk)

= 0 otherwise

Finally each internal node is associated witha vector ui

So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 211: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

53/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .max e

vTc uw∑

|V |evTc uw

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary

There exists a unique path from the rootnode to a leaf node.

Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w

Let π(w) be a binary vector such that:

π(w)k = 1 path branches left at node l(wk)

= 0 otherwise

Finally each internal node is associated witha vector ui

So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 212: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

53/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .max e

vTc uw∑

|V |evTc uw

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary

There exists a unique path from the rootnode to a leaf node.

Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w

Let π(w) be a binary vector such that:

π(w)k = 1 path branches left at node l(wk)

= 0 otherwise

Finally each internal node is associated witha vector ui

So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 213: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

53/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .max e

vTc uw∑

|V |evTc uw

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

Construct a binary tree such that there are|V | leaf nodes each corresponding to oneword in the vocabulary

There exists a unique path from the rootnode to a leaf node.

Let l(w1), l(w2), ..., l(wp) be the nodes onthe path from root to w

Let π(w) be a binary vector such that:

π(w)k = 1 path branches left at node l(wk)

= 0 otherwise

Finally each internal node is associated witha vector ui

So the parameters of the module areWcontext and u1, u2, . . . , uv (in effect, wehave the same number of parameters as be-fore)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 214: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

54/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

For a given pair (w, c) we are interested inthe probability p(w|vc)

We model this probability as

p(w|vc) =∏k

(π(wk)|vc)

For example

P (on|vsat) = P (π(on)1 = 1|vsat)∗P (π(on)2 = 0|vsat)∗P (π(on)3 = 0|vsat)

In effect, we are saying that the probabilityof predicting a word is the same as predictingthe correct unique path from the root nodeto that word

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 215: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

54/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

For a given pair (w, c) we are interested inthe probability p(w|vc)We model this probability as

p(w|vc) =∏k

(π(wk)|vc)

For example

P (on|vsat) = P (π(on)1 = 1|vsat)∗P (π(on)2 = 0|vsat)∗P (π(on)3 = 0|vsat)

In effect, we are saying that the probabilityof predicting a word is the same as predictingthe correct unique path from the root nodeto that word

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 216: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

54/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

For a given pair (w, c) we are interested inthe probability p(w|vc)We model this probability as

p(w|vc) =∏k

(π(wk)|vc)

For example

P (on|vsat) = P (π(on)1 = 1|vsat)∗P (π(on)2 = 0|vsat)∗P (π(on)3 = 0|vsat)

In effect, we are saying that the probabilityof predicting a word is the same as predictingthe correct unique path from the root nodeto that word

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 217: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

54/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

For a given pair (w, c) we are interested inthe probability p(w|vc)We model this probability as

p(w|vc) =∏k

(π(wk)|vc)

For example

P (on|vsat) = P (π(on)1 = 1|vsat)∗P (π(on)2 = 0|vsat)∗P (π(on)3 = 0|vsat)

In effect, we are saying that the probabilityof predicting a word is the same as predictingthe correct unique path from the root nodeto that word

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 218: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

55/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

We model

P (π(on)i = 1) =1

1 + e−vTc ui

P (π(on)i = 0) = 1− P (π(on)i = 1)

P (π(on)i = 0) =1

1 + evTc ui

The above model ensures that the repres-entation of a context word vc will have ahigh(low) similarity with the representationof the node ui if ui appears and the pathbranches to the left(right) at ui

Again, transitively the representations ofcontexts which appear with the same wordswill have high similarity

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 219: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

55/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

We model

P (π(on)i = 1) =1

1 + e−vTc ui

P (π(on)i = 0) = 1− P (π(on)i = 1)

P (π(on)i = 0) =1

1 + evTc ui

The above model ensures that the repres-entation of a context word vc will have ahigh(low) similarity with the representationof the node ui if ui appears and the pathbranches to the left(right) at ui

Again, transitively the representations ofcontexts which appear with the same wordswill have high similarity

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 220: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

55/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

We model

P (π(on)i = 1) =1

1 + e−vTc ui

P (π(on)i = 0) = 1− P (π(on)i = 1)

P (π(on)i = 0) =1

1 + evTc ui

The above model ensures that the repres-entation of a context word vc will have ahigh(low) similarity with the representationof the node ui if ui appears and the pathbranches to the left(right) at ui

Again, transitively the representations ofcontexts which appear with the same wordswill have high similarity

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 221: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

56/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

P (w|vc) =

|π(w)|∏k=1

P (π(wk)|vc)

Note that p(w|vc) can now be com-puted using |π(w)| computations in-stead of |V | required by softmax

How do we construct the binary tree?

Turns out that even a random ar-rangement of the words on leaf nodesdoes well in practice

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 222: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

56/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

P (w|vc) =

|π(w)|∏k=1

P (π(wk)|vc)

Note that p(w|vc) can now be com-puted using |π(w)| computations in-stead of |V | required by softmax

How do we construct the binary tree?

Turns out that even a random ar-rangement of the words on leaf nodesdoes well in practice

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 223: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

56/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

P (w|vc) =

|π(w)|∏k=1

P (π(wk)|vc)

Note that p(w|vc) can now be com-puted using |π(w)| computations in-stead of |V | required by softmax

How do we construct the binary tree?

Turns out that even a random ar-rangement of the words on leaf nodesdoes well in practice

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 224: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

56/70

0 1 0 ... 0 0 0

sat

. . . . . . . . . .

. . . 1 . . . . . . . .

. . . .

π(on)1 = 1

π(on)2 = 0

π(on)3 = 0

u1

u2

uV

on

h = vc

P (w|vc) =

|π(w)|∏k=1

P (π(wk)|vc)

Note that p(w|vc) can now be com-puted using |π(w)| computations in-stead of |V | required by softmax

How do we construct the binary tree?

Turns out that even a random ar-rangement of the words on leaf nodesdoes well in practice

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 225: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

57/70

Module 10.8: GloVe representations

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 226: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

58/70

Count based methods (SVD) rely on global co-occurrence counts from thecorpus for computing word representations

Predict based methods learn word representations using co-occurrence inform-ation

Why not combine the two (count and learn) ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 227: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

58/70

Count based methods (SVD) rely on global co-occurrence counts from thecorpus for computing word representations

Predict based methods learn word representations using co-occurrence inform-ation

Why not combine the two (count and learn) ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 228: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

58/70

Count based methods (SVD) rely on global co-occurrence counts from thecorpus for computing word representations

Predict based methods learn word representations using co-occurrence inform-ation

Why not combine the two (count and learn) ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 229: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

59/70

Corpus:Human machine interface for computer applications

User opinion of computer system response time

User interface management system

System engineering for improved response time

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

P (j|i) =Xij∑Xij

=XijXi

Xij = Xji

Xij encodes important global informationabout the co-occurrence between i and j(global: because it is computed for the entirecorpus)

Why not learn word vectors which are faith-ful to this information?

For example, enforce

vTi vj = logP (j|i)= logXij − log(Xi)

Similarly,

vTj vi = logXij − logXj (Xij = Xji)

Essentially we are saying that we want wordvectors vi and vj such that vTi vj is faithfulto the globally computed P (j|i)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 230: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

59/70

Corpus:Human machine interface for computer applications

User opinion of computer system response time

User interface management system

System engineering for improved response time

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

P (j|i) =Xij∑Xij

=XijXi

Xij = Xji

Xij encodes important global informationabout the co-occurrence between i and j(global: because it is computed for the entirecorpus)

Why not learn word vectors which are faith-ful to this information?

For example, enforce

vTi vj = logP (j|i)= logXij − log(Xi)

Similarly,

vTj vi = logXij − logXj (Xij = Xji)

Essentially we are saying that we want wordvectors vi and vj such that vTi vj is faithfulto the globally computed P (j|i)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 231: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

59/70

Corpus:Human machine interface for computer applications

User opinion of computer system response time

User interface management system

System engineering for improved response time

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

P (j|i) =Xij∑Xij

=XijXi

Xij = Xji

Xij encodes important global informationabout the co-occurrence between i and j(global: because it is computed for the entirecorpus)

Why not learn word vectors which are faith-ful to this information?

For example, enforce

vTi vj = logP (j|i)= logXij − log(Xi)

Similarly,

vTj vi = logXij − logXj (Xij = Xji)

Essentially we are saying that we want wordvectors vi and vj such that vTi vj is faithfulto the globally computed P (j|i)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 232: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

59/70

Corpus:Human machine interface for computer applications

User opinion of computer system response time

User interface management system

System engineering for improved response time

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

P (j|i) =Xij∑Xij

=XijXi

Xij = Xji

Xij encodes important global informationabout the co-occurrence between i and j(global: because it is computed for the entirecorpus)

Why not learn word vectors which are faith-ful to this information?

For example, enforce

vTi vj = logP (j|i)= logXij − log(Xi)

Similarly,

vTj vi = logXij − logXj (Xij = Xji)

Essentially we are saying that we want wordvectors vi and vj such that vTi vj is faithfulto the globally computed P (j|i)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 233: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

59/70

Corpus:Human machine interface for computer applications

User opinion of computer system response time

User interface management system

System engineering for improved response time

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

P (j|i) =Xij∑Xij

=XijXi

Xij = Xji

Xij encodes important global informationabout the co-occurrence between i and j(global: because it is computed for the entirecorpus)

Why not learn word vectors which are faith-ful to this information?

For example, enforce

vTi vj = logP (j|i)= logXij − log(Xi)

Similarly,

vTj vi = logXij − logXj (Xij = Xji)

Essentially we are saying that we want wordvectors vi and vj such that vTi vj is faithfulto the globally computed P (j|i)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 234: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

60/70

Corpus:Human machine interface for computer applications

User opinion of computer system response time

User interface management system

System engineering for improved response time

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

P (j|i) =Xij∑Xij

=XijXi

Xij = Xji

Adding the two equations we get

2vTi vj = 2 logXij − logXi − logXj

vTi vj = logXij −1

2logXi −

1

2logXj

Note that logXi and logXj depend only onthe words i & j and we can think of them asword specific biases which will be learned

vTi vj = logXij − bi − bjvTi vj + bi + bj = logXij

We can then formulate this as the followingoptimization problem

minvi,vj ,bi,bj

∑i,j

(vTi vj + bi + bj︸ ︷︷ ︸predicted valueusing modelparameters

− logXij︸ ︷︷ ︸actual value

computed fromthe given corpus

)2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 235: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

60/70

Corpus:Human machine interface for computer applications

User opinion of computer system response time

User interface management system

System engineering for improved response time

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

P (j|i) =Xij∑Xij

=XijXi

Xij = Xji

Adding the two equations we get

2vTi vj = 2 logXij − logXi − logXj

vTi vj = logXij −1

2logXi −

1

2logXj

Note that logXi and logXj depend only onthe words i & j and we can think of them asword specific biases which will be learned

vTi vj = logXij − bi − bjvTi vj + bi + bj = logXij

We can then formulate this as the followingoptimization problem

minvi,vj ,bi,bj

∑i,j

(vTi vj + bi + bj︸ ︷︷ ︸predicted valueusing modelparameters

− logXij︸ ︷︷ ︸actual value

computed fromthe given corpus

)2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 236: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

60/70

Corpus:Human machine interface for computer applications

User opinion of computer system response time

User interface management system

System engineering for improved response time

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

P (j|i) =Xij∑Xij

=XijXi

Xij = Xji

Adding the two equations we get

2vTi vj = 2 logXij − logXi − logXj

vTi vj = logXij −1

2logXi −

1

2logXj

Note that logXi and logXj depend only onthe words i & j and we can think of them asword specific biases which will be learned

vTi vj = logXij − bi − bjvTi vj + bi + bj = logXij

We can then formulate this as the followingoptimization problem

minvi,vj ,bi,bj

∑i,j

(vTi vj + bi + bj︸ ︷︷ ︸predicted valueusing modelparameters

− logXij︸ ︷︷ ︸actual value

computed fromthe given corpus

)2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 237: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

61/70

Corpus:Human machine interface for computer applications

User opinion of computer system response time

User interface management system

System engineering for improved response time

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

P (j|i) =Xij∑Xij

=Xij∑Xi

Xij = Xji

minvi,vj ,bi,bj

∑i,j

(vTi vj + bi + bj − logXij)2

Drawback: weighs all co-occurrencesequally

Solution: add a weighting function

minvi,vj ,bi,bj

∑i,j

f(Xij)(vTi vj + bi + bj − logXij)

2

Wishlist: f(Xij) should be such thatneither rare nor frequent words are over-weighted.

f(x) =

{( xxmax

)α, if x < xmax1, otherwise

}where α can be tuned for a given dataset

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 238: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

61/70

Corpus:Human machine interface for computer applications

User opinion of computer system response time

User interface management system

System engineering for improved response time

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

P (j|i) =Xij∑Xij

=Xij∑Xi

Xij = Xji

minvi,vj ,bi,bj

∑i,j

(vTi vj + bi + bj − logXij)2

Drawback: weighs all co-occurrencesequally

Solution: add a weighting function

minvi,vj ,bi,bj

∑i,j

f(Xij)(vTi vj + bi + bj − logXij)

2

Wishlist: f(Xij) should be such thatneither rare nor frequent words are over-weighted.

f(x) =

{( xxmax

)α, if x < xmax1, otherwise

}where α can be tuned for a given dataset

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 239: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

61/70

Corpus:Human machine interface for computer applications

User opinion of computer system response time

User interface management system

System engineering for improved response time

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

P (j|i) =Xij∑Xij

=Xij∑Xi

Xij = Xji

minvi,vj ,bi,bj

∑i,j

(vTi vj + bi + bj − logXij)2

Drawback: weighs all co-occurrencesequally

Solution: add a weighting function

minvi,vj ,bi,bj

∑i,j

f(Xij)(vTi vj + bi + bj − logXij)

2

Wishlist: f(Xij) should be such thatneither rare nor frequent words are over-weighted.

f(x) =

{( xxmax

)α, if x < xmax1, otherwise

}where α can be tuned for a given dataset

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 240: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

61/70

Corpus:Human machine interface for computer applications

User opinion of computer system response time

User interface management system

System engineering for improved response time

X =

human machine system for ... userhuman 2.01 2.01 0.23 2.14 ... 0.43machine 2.01 2.01 0.23 2.14 ... 0.43system 0.23 0.23 1.17 0.96 ... 1.29

for 2.14 2.14 0.96 1.87 ... -0.13. . . . . . .. . . . . . .. . . . . . .

user 0.43 0.43 1.29 -0.13 ... 1.71

P (j|i) =Xij∑Xij

=Xij∑Xi

Xij = Xji

minvi,vj ,bi,bj

∑i,j

(vTi vj + bi + bj − logXij)2

Drawback: weighs all co-occurrencesequally

Solution: add a weighting function

minvi,vj ,bi,bj

∑i,j

f(Xij)(vTi vj + bi + bj − logXij)

2

Wishlist: f(Xij) should be such thatneither rare nor frequent words are over-weighted.

f(x) =

{( xxmax

)α, if x < xmax1, otherwise

}where α can be tuned for a given dataset

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 241: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

62/70

Module 10.9: Evaluating word representations

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 242: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

63/70

How do we evaluate the learned word representations ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 243: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

64/70

Shuman(cat, dog) = 0.8

Smodel(cat, dog) =vTcatvdog

‖ vcat ‖‖ vdog ‖= 0.7

Semantic Relatedness

Ask humans to judge the relatednessbetween a pair of words

Compute the cosine similaritybetween the corresponding wordvectors learned by the model

Given a large number of suchword pairs, compute the correlationbetween Smodel & Shuman, and com-pare different models

Model 1 is better than Model 2 if

correlation(Smodel1, Shuman)

> correlation(Smodel2, Shuman)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 244: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

64/70

Shuman(cat, dog) = 0.8

Smodel(cat, dog) =vTcatvdog

‖ vcat ‖‖ vdog ‖= 0.7

Semantic Relatedness

Ask humans to judge the relatednessbetween a pair of words

Compute the cosine similaritybetween the corresponding wordvectors learned by the model

Given a large number of suchword pairs, compute the correlationbetween Smodel & Shuman, and com-pare different models

Model 1 is better than Model 2 if

correlation(Smodel1, Shuman)

> correlation(Smodel2, Shuman)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 245: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

64/70

Shuman(cat, dog) = 0.8

Smodel(cat, dog) =vTcatvdog

‖ vcat ‖‖ vdog ‖= 0.7

Semantic Relatedness

Ask humans to judge the relatednessbetween a pair of words

Compute the cosine similaritybetween the corresponding wordvectors learned by the model

Given a large number of suchword pairs, compute the correlationbetween Smodel & Shuman, and com-pare different models

Model 1 is better than Model 2 if

correlation(Smodel1, Shuman)

> correlation(Smodel2, Shuman)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 246: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

64/70

Shuman(cat, dog) = 0.8

Smodel(cat, dog) =vTcatvdog

‖ vcat ‖‖ vdog ‖= 0.7

Semantic Relatedness

Ask humans to judge the relatednessbetween a pair of words

Compute the cosine similaritybetween the corresponding wordvectors learned by the model

Given a large number of suchword pairs, compute the correlationbetween Smodel & Shuman, and com-pare different models

Model 1 is better than Model 2 if

correlation(Smodel1, Shuman)

> correlation(Smodel2, Shuman)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 247: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

64/70

Shuman(cat, dog) = 0.8

Smodel(cat, dog) =vTcatvdog

‖ vcat ‖‖ vdog ‖= 0.7

Semantic Relatedness

Ask humans to judge the relatednessbetween a pair of words

Compute the cosine similaritybetween the corresponding wordvectors learned by the model

Given a large number of suchword pairs, compute the correlationbetween Smodel & Shuman, and com-pare different models

Model 1 is better than Model 2 if

correlation(Smodel1, Shuman)

> correlation(Smodel2, Shuman)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 248: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

65/70

Term : levied

Candidates : {unposed,

believed, requested, correlated}

Synonym : = argmaxc∈C

cosine(vterm, vc)

Synonym Detection

Given: a term and four candidatesynonyms

Pick the candidate which has thelargest cosine similarity with the term

Compute the accuracy of differentmodels and compare

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 249: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

65/70

Term : levied

Candidates : {unposed,

believed, requested, correlated}

Synonym : = argmaxc∈C

cosine(vterm, vc)

Synonym Detection

Given: a term and four candidatesynonyms

Pick the candidate which has thelargest cosine similarity with the term

Compute the accuracy of differentmodels and compare

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 250: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

65/70

Term : levied

Candidates : {unposed,

believed, requested, correlated}

Synonym : = argmaxc∈C

cosine(vterm, vc)

Synonym Detection

Given: a term and four candidatesynonyms

Pick the candidate which has thelargest cosine similarity with the term

Compute the accuracy of differentmodels and compare

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 251: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

65/70

Term : levied

Candidates : {unposed,

believed, requested, correlated}

Synonym : = argmaxc∈C

cosine(vterm, vc)

Synonym Detection

Given: a term and four candidatesynonyms

Pick the candidate which has thelargest cosine similarity with the term

Compute the accuracy of differentmodels and compare

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 252: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

66/70

brother : sister :: grandson : ?work : works :: speak : ?

Analogy

Semantic Analogy: Find nearestneighbour of vsister − vbrother +vgrandson

Syntactic Analogy: Find nearestneighbour of vworks − vwork + vspeak

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 253: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

66/70

brother : sister :: grandson : ?

work : works :: speak : ?

Analogy

Semantic Analogy: Find nearestneighbour of vsister − vbrother +vgrandson

Syntactic Analogy: Find nearestneighbour of vworks − vwork + vspeak

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 254: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

66/70

brother : sister :: grandson : ?work : works :: speak : ?

Analogy

Semantic Analogy: Find nearestneighbour of vsister − vbrother +vgrandson

Syntactic Analogy: Find nearestneighbour of vworks − vwork + vspeak

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 255: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

67/70

So which algorithm gives the best result ?

Boroni et.al [2014] showed that predict models consistently outperform countmodels in all tasks.

Levy et.al [2015] do a much more through analysis (IMO) and show that goodold SVD does better than prediction based models on similarity tasks but noton analogy tasks.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 256: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

67/70

So which algorithm gives the best result ?

Boroni et.al [2014] showed that predict models consistently outperform countmodels in all tasks.

Levy et.al [2015] do a much more through analysis (IMO) and show that goodold SVD does better than prediction based models on similarity tasks but noton analogy tasks.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 257: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

67/70

So which algorithm gives the best result ?

Boroni et.al [2014] showed that predict models consistently outperform countmodels in all tasks.

Levy et.al [2015] do a much more through analysis (IMO) and show that goodold SVD does better than prediction based models on similarity tasks but noton analogy tasks.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 258: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

68/70

Module 10.10: Relation between SVD & word2Vec

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 259: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

69/70

The story ahead ...

Continuous bag of words model

Skip gram model with negative sampling (the famous word2vec)

GloVe word embeddings

Evaluating word embeddings

Good old SVD does just fine!!

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 260: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

70/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Recall that SVD does a matrix factorizationof the co-occurrence matrix

Levy et.al [2015] show that word2vec alsoimplicitly does a matrix factorization

What does this mean ?

Recall that word2vec gives us Wcontext &Wword

.

Turns out that we can also show that

M = Wcontext ∗Wword

where

Mij = PMI(wi, ci)− log(k)

k = number of negative samples

So essentially, word2vec factorizes a mat-rix M which is related to the PMI basedco-occurrence matrix (very similar to whatSVD does)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 261: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

70/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Recall that SVD does a matrix factorizationof the co-occurrence matrix

Levy et.al [2015] show that word2vec alsoimplicitly does a matrix factorization

What does this mean ?

Recall that word2vec gives us Wcontext &Wword

.

Turns out that we can also show that

M = Wcontext ∗Wword

where

Mij = PMI(wi, ci)− log(k)

k = number of negative samples

So essentially, word2vec factorizes a mat-rix M which is related to the PMI basedco-occurrence matrix (very similar to whatSVD does)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 262: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

70/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Recall that SVD does a matrix factorizationof the co-occurrence matrix

Levy et.al [2015] show that word2vec alsoimplicitly does a matrix factorization

What does this mean ?

Recall that word2vec gives us Wcontext &Wword

.

Turns out that we can also show that

M = Wcontext ∗Wword

where

Mij = PMI(wi, ci)− log(k)

k = number of negative samples

So essentially, word2vec factorizes a mat-rix M which is related to the PMI basedco-occurrence matrix (very similar to whatSVD does)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 263: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

70/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Recall that SVD does a matrix factorizationof the co-occurrence matrix

Levy et.al [2015] show that word2vec alsoimplicitly does a matrix factorization

What does this mean ?

Recall that word2vec gives us Wcontext &Wword .

Turns out that we can also show that

M = Wcontext ∗Wword

where

Mij = PMI(wi, ci)− log(k)

k = number of negative samples

So essentially, word2vec factorizes a mat-rix M which is related to the PMI basedco-occurrence matrix (very similar to whatSVD does)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 264: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

70/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Recall that SVD does a matrix factorizationof the co-occurrence matrix

Levy et.al [2015] show that word2vec alsoimplicitly does a matrix factorization

What does this mean ?

Recall that word2vec gives us Wcontext &Wword .

Turns out that we can also show that

M = Wcontext ∗Wword

where

Mij = PMI(wi, ci)− log(k)

k = number of negative samples

So essentially, word2vec factorizes a mat-rix M which is related to the PMI basedco-occurrence matrix (very similar to whatSVD does)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Page 265: CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

70/70

0 0 1 ... 0 0 0

. . . . . . . . . .

he sat a chair

h ∈ R|k|

Wcontext ∈

Rk×|V |

x ∈ R|V |

Wword ∈ Rk×|V |

Recall that SVD does a matrix factorizationof the co-occurrence matrix

Levy et.al [2015] show that word2vec alsoimplicitly does a matrix factorization

What does this mean ?

Recall that word2vec gives us Wcontext &Wword .

Turns out that we can also show that

M = Wcontext ∗Wword

where

Mij = PMI(wi, ci)− log(k)

k = number of negative samples

So essentially, word2vec factorizes a mat-rix M which is related to the PMI basedco-occurrence matrix (very similar to whatSVD does)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10