Review: Main idea of word2vec
- Start with random word vectors
- Iterate through each word position in the whole corpus
- Try to predict surrounding words using word vectors: $P(o|c) = \frac{\exp{u_o^T vc}}{\sum{w \in V} \exp{u_w^T v_c}}$
- Learning: Update vectors so they can predict actual surrounding words better
- For each word, there are two word vectors, one for outside word and another for center word (when this word is).
- $softmax(U v_4^T)$ represents predicted probability, a softmax results of the dot product of outside words vectors and center word vector.
The skip-gram model with negative sampling
The normalization term is computationally expensive. Hence, in standard word2vec we implement the skip-gram model with negative sampling
Main idea: train binary logistic regressions to differentiate a true pair (center word and a word in its context window) versus several “noise” pairs (the center word paired with a random word)
In fact, there are two variants of word2vec algorithm family: Skip-grams (SG) and Continuous Bag of Words (CBOW). The former predict context (“outside”) words (position independent) given center word. And the later predict center word from (bag of) context words.
For loss function, we have naive softmax (simple but expensive loss function, when many output classes) and negative sampling.
Objective function (to minimize)
$$
J_{neg-sample}(\bold{u}_o, \bold{v}_c, \bold{U}) = - \log{\sigma(\bold{u}_o^T \bold{v}c)} - \sum{k \in {K sampled \ indices }} \log{\sigma(-\bold{u}_k^T \bold{v}_c)} \
\sigma(x) = \frac{e^x}{1 + e^x}
$$
In practice, we could use $1 - \sigma(x)$ to replace $\sigma(-x)$ .
Sample with $P(w)=U(w)^{3/4}/Z$ , the unigram distribution U(w) raised to the 3/4 power. The power makes less frequent words be sampled more often
Stochastic gradients with negative sampling
- We iteratively take gradients at each window for SGD
- In each window, we only have at most $2m + 1$ words (actually appear) plus $2km$ negative words (random sample) with negative sampling, so gradient is very sparse!
- We update rows of $\bold{U}$ and $\bold{V}$ . (rows not columns in actual coding work)
GloVe
Use log-bilinear model to capture ratios of co-occurrence probalbilities as linear meaning components in a word vector space.
$$
w_i \cdot w_j = \log{P(i|j)} \
w_x \cdot (w_a - w_b) = \log{\frac{P(x|a)}{P(x|b)}}
$$
Skip details.
Evaluation of word vectors
Intrinsic
- Evaluation on a specific/intermediate subtask
- Fast to compute
- Helps to understand that system
- Not clear if really helpful unless correlation to real task is established
Like do word analogy
Extrinsic
- Evaluation on a real task
- Can take a long time to compute accuracy
- Unclear if the subsystem is the problem or its interaction or other subsystems
- If replacing exactly one subsystem with another improves accuracy àWinning!
Word senses and word sense ambiguity
One word could have many different meanings, does one vector capture all these meanings?
- Different senses of a word reside in a linear superposition (weighted sum) in standard word embeddings like word2vec
- $\bold{v}_{pike} = \alpha1 \bold{v}{pike_1} + \alpha2 \bold{v}{pike_2} + \alpha3 \bold{v}{pike_3}$
- $\alpha_1 = \frac{f_1}{f_1 + f_2 + f_3}$ where $f$ means frequency
Deep Learning Classification: Named Entity Recognition (NER)
The task: find and classify names in text, by labeling word tokens.
Comments NOTHING