Review: Main idea of word2vec

Start with random word vectors
Iterate through each word position in the whole corpus
Try to predict surrounding words using word vectors: $P(o|c) = \frac{\exp{u_o^T vc}}{\sum{w \in V} \exp{u_w^T v_c}}$
Learning: Update vectors so they can predict actual surrounding words better
For each word, there are two word vectors, one for outside word and another for center word (when this word is).
$softmax(U v_4^T)$ represents predicted probability, a softmax results of the dot product of outside words vectors and center word vector.

The skip-gram model with negative sampling

The normalization term is computationally expensive. Hence, in standard word2vec we implement the skip-gram model with negative sampling

Main idea: train binary logistic regressions to differentiate a true pair (center word and a word in its context window) versus several “noise” pairs (the center word paired with a random word)

In fact, there are two variants of word2vec algorithm family: Skip-grams (SG) and Continuous Bag of Words (CBOW). The former predict context (“outside”) words (position independent) given center word. And the later predict center word from (bag of) context words.

For loss function, we have naive softmax (simple but expensive loss function, when many output classes) and negative sampling.

Objective function (to minimize)

$$
J_{neg-sample}(\bold{u}_o, \bold{v}_c, \bold{U}) = - \log{\sigma(\bold{u}_o^T \bold{v}c)} - \sum{k \in {K sampled \ indices }} \log{\sigma(-\bold{u}_k^T \bold{v}_c)} \
\sigma(x) = \frac{e^x}{1 + e^x}
$$

In practice, we could use $1 - \sigma(x)$ to replace $\sigma(-x)$ .

Sample with $P(w)=U(w)^{3/4}/Z$ , the unigram distribution U(w) raised to the 3/4 power. The power makes less frequent words be sampled more often

Stochastic gradients with negative sampling

We iteratively take gradients at each window for SGD
In each window, we only have at most $2m + 1$ words (actually appear) plus $2km$ negative words (random sample) with negative sampling, so gradient is very sparse!
We update rows of $\bold{U}$ and $\bold{V}$ . (rows not columns in actual coding work)

GloVe

Use log-bilinear model to capture ratios of co-occurrence probalbilities as linear meaning components in a word vector space.

$$
w_i \cdot w_j = \log{P(i|j)} \
w_x \cdot (w_a - w_b) = \log{\frac{P(x|a)}{P(x|b)}}
$$

Skip details.

Evaluation of word vectors

Intrinsic

Evaluation on a specific/intermediate subtask
Fast to compute
Helps to understand that system
Not clear if really helpful unless correlation to real task is established

Like do word analogy

Extrinsic

Evaluation on a real task
Can take a long time to compute accuracy
Unclear if the subsystem is the problem or its interaction or other subsystems
If replacing exactly one subsystem with another improves accuracy àWinning!

Word senses and word sense ambiguity

One word could have many different meanings, does one vector capture all these meanings?

Different senses of a word reside in a linear superposition (weighted sum) in standard word embeddings like word2vec
$\bold{v}_{pike} = \alpha1 \bold{v}{pike_1} + \alpha2 \bold{v}{pike_2} + \alpha3 \bold{v}{pike_3}$
$\alpha_1 = \frac{f_1}{f_1 + f_2 + f_3}$ where $f$ means frequency

Deep Learning Classification: Named Entity Recognition (NER)

The task: find and classify names in text, by labeling word tokens.

CS224N Lecture 2: Word Vectors, Word Senses, and Neural Classifiers