19
OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790

Optimization of Skip-gram Model - Duke University

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

OPTIMIZATION OF SKIP-GRAM MODEL

Chenxi WuFinal Presentation for STA 790

Word Embedding■ Map words to vectors of real numbers

■ The earliest word representation is ”one hot representation”

Word Embedding■ Distributed representation

Word2Vec■ An unsupervised NLP method developed by Google in 2013

■ Quantify the relationship between words

word2vecSkip-gram

Hierarchical Softmax

Negative Sampling

CBOW

Skip-gram■ Input a vector representation of a specific word

■ Output the context word vector corresponding to this word

DNN (Deep Neural Network)

Huffman Tree■ Leaf nodes denote all words in the vocabulary

■ The leaf nodes act as neurons in the output layer, and the internal nodes act as hidden neurons.

■ Input: n weights f1, f2, ..., fn (The frequency of each word in the corpus)

■ Output: The corresponding Huffman tree

■ Benefit: Common words have shorter Huffman code

■ (1) Treat f1, f2, ..., fn as a forest with n trees (Each tree has only one node);

■ (2) In the forest, select the two trees with the smallest weights to merge as the left and right subtrees of a new tree. And the weight of the root node of this new tree is the sum of the weights of the left and right child nodes;

■ (3) Delete the two selected trees from the forest and add the new trees to the forest;

■ (4) Repeat steps (2) and (3) until there is only one tree left in the forest

Hierarchical Softmax

HS Details

HS Details■ use sigmoid function to decide whether to go left (+) or go right (-)

■ In the example above, w is "hierarchical".

HS Target Function

HS Gradient

Negative Sampling■ Alternative method for training Skip-gram model

■ Subsampling frequent words to decrease the number of training examples.

■ Let each training sample to update only a small percentage of the model’s weights.

Negative sample■ randomly select one word u from its surrounding words, so u and w compose one

"positive sample".

■ The negative sample would be to use this same u, we randomly choose a word from the dictionary that is not w.

Sampling method■ The unigram distribution is used to select negative words.

■ The probability of a word being selected as a negative sample is related to the frequency of its occurrence. The higher the frequency of occurrence, the easier it is to select as negative words

NS Details ■ Still use sigmoid function to train the model

■ Suppose through negative sampling, we get neg negative samples (context(w), w_i), I = 1, 2, …, neg. So each training sample is (context(w), w, w_1, …w_neg).

■ We expect our positive sample to satisfy:

■ Expect negative samples to satisfy:

NS Details■ Want to maximize the following log-likelihood:

■ Similarly, compute gradient to update parameters.

ReferenceMikolov et al., 2013, Distributed Representations of Words and Phrases and their Compositionalityhttp://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and

The code for this implementation can be found on my GItHub repo: https://github.com/cassie1102