Word Embeddings with Word2Vec

Word embeddings is a type of word representations that encode general semantic relationships. As most machine learning techniques do not accept text as direct inputs, text data must transform into numbers before they can be utilised.

Representing them using arbitrary labels such as “id100”, “id101” for words like “aircraft” and “airplane” provide no useful information to the system regarding their relationship that may exist.

Encoding large text using one-hot encodings generates a large matrix. (Typically thousands dimension if not millions) When computing the final softmax, it gets prohibitively expensive that it has to loop through the whole vocabulary to calculate the normalised probability during each training batch.

There is a long history in word embedding research. Bengio et al [1] proposed a one-hidden layer neural network that predicts the next word in a sequence. See network diagram below.


The architecture consists an embedding layer, a hidden layer, and a softmax layer. This later became the foundation of newer models such as Word2Vec and Glove. However, updating the probability of large word corpus in softmax layer is very slow, it became the main bottleneck. Refer to the equation below

\displaystyle p(w_o \vert w_I) = \frac{\exp({v'{w_o}}^{\top} v{w_I})} {\sum_{i=1}^V \exp({v'{w_i}}^{\top} v{w_I})}


Word2Vec attempts to address the slowness of the classic language model by discarding the hidden layer and make use of alternative candidate sampling to approximate the normalisation in the denominator of the softmax function.

One of the method is called Noise Contrastive Estimation. Instead of summing up the probability distribution over the entire corpus, it reinforces the strength of weights which links to both positive (in context) and negative words (out of context). The training labels become 1 and 0 rather than the context word. This effectively changed the original problem to a binary classification proxy problem.

The loss objective becomes

-[\log p(D=1|w,w_i) + \sum_{\tilde{w}∼Q}^{N}\left(\log(D=0|\tilde{w_i},w_I)\right)]

We can calculate the probability over D=1 case with:

\displaystyle P(D=1|w,c)=\frac{P(w \vert c)}{P(w \vert c)+kQ(w)}

The probability on D = 0 case can be calculated as 1 - P(D = 1)

Since calculating P (w|c)` still require to sum over the probability of the entire corpus. Mnih and Teh (2012) and Vaswani et al. then sets the expensive denominator fixed to 1, which they claimed it does not affect the model’s performance. This makes the model convey much faster.

The final logistic loss objective is then

\displaystyle L_\theta = -\sum_{w_i\in{Q}}[\log\frac{\exp(h^Tv'w_{i})}{\exp(h^Tv'w_{i})+kQ(w_i)}+\sum_{j=i}^{k}\log(1-\frac{\exp(h^Tv'w_{ij})}{\exp(h^Tv'\tilde{w_{ij}})+kQ(\tilde{w}_{ij})})]

See Full Source Code on Github


After the word embedding is obtained, they can be visualised using dimensional reduction techniques such as PCA or T-SNE.

In tensorboard, both visualisation is available, see below.