Word Embedding■ Map words to vectors of real numbers
■ The earliest word representation is ”one hot representation”
Word2Vec■ An unsupervised NLP method developed by Google in 2013
■ Quantify the relationship between words
word2vecSkip-gram
Hierarchical Softmax
Negative Sampling
CBOW
Skip-gram■ Input a vector representation of a specific word
■ Output the context word vector corresponding to this word
Huffman Tree■ Leaf nodes denote all words in the vocabulary
■ The leaf nodes act as neurons in the output layer, and the internal nodes act as hidden neurons.
■ Input: n weights f1, f2, ..., fn (The frequency of each word in the corpus)
■ Output: The corresponding Huffman tree
■ Benefit: Common words have shorter Huffman code
■ (1) Treat f1, f2, ..., fn as a forest with n trees (Each tree has only one node);
■ (2) In the forest, select the two trees with the smallest weights to merge as the left and right subtrees of a new tree. And the weight of the root node of this new tree is the sum of the weights of the left and right child nodes;
■ (3) Delete the two selected trees from the forest and add the new trees to the forest;
■ (4) Repeat steps (2) and (3) until there is only one tree left in the forest
HS Details■ use sigmoid function to decide whether to go left (+) or go right (-)
■ In the example above, w is "hierarchical".
Negative Sampling■ Alternative method for training Skip-gram model
■ Subsampling frequent words to decrease the number of training examples.
■ Let each training sample to update only a small percentage of the model’s weights.
Negative sample■ randomly select one word u from its surrounding words, so u and w compose one
"positive sample".
■ The negative sample would be to use this same u, we randomly choose a word from the dictionary that is not w.
Sampling method■ The unigram distribution is used to select negative words.
■ The probability of a word being selected as a negative sample is related to the frequency of its occurrence. The higher the frequency of occurrence, the easier it is to select as negative words
NS Details ■ Still use sigmoid function to train the model
■ Suppose through negative sampling, we get neg negative samples (context(w), w_i), I = 1, 2, …, neg. So each training sample is (context(w), w, w_1, …w_neg).
■ We expect our positive sample to satisfy:
■ Expect negative samples to satisfy:
NS Details■ Want to maximize the following log-likelihood:
■ Similarly, compute gradient to update parameters.
ReferenceMikolov et al., 2013, Distributed Representations of Words and Phrases and their Compositionalityhttp://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and
The code for this implementation can be found on my GItHub repo: https://github.com/cassie1102