Intro & Word Vectors
Eventhough most of deep learing and even word vector learning seems like magic, that it’s not really magic. It’s really just doing math. —— Chris Manning
Meaning of a word
Definition of meaning:
- the idea that is represented by a word, phrase, etc.
- the idea that a person wants to express by using words, sings, etc.
- the idea that is expressed in a work of writing, art, etc.
Commonest linguistic way of thinking of meaning
signifier (symbol) $\Leftrightarrow$ signified (idea or thing)
That’s denotational semantics (语义). Like the word “chair” means a chair.
But this model is not very deeply implementable. We don’t know how to construct a connection from a word “chair” to something in the true world.
Common NLP solution
Use dictionaries like WorldNet which contain lists of synonym sets and hypernyms (”is a” relationship).
The problem:
- Great as a resource but missing nuance
- Missing new meaning of words
- Subjective
- Requires human labor to create and adapt
- Can’t compute accurate word similarity
Representing words as discrete symbols
each word was encode into an one-hot vector (vector only containing zero and one), like:
motel = [0, 0, 0, 1]
hotel = [1, 0, 1, 0]
The problem:
- We don’t know the relationship and similarity between them.
Even though we can use dictionaries to construct the connection, it’s still incompleteness.
Modern deep learning method: representing words by their context
You should know a word by the company it keeps. ——J. R. Firth
Distribution semantics: A word’s meaning is given by the words that frequently appear close-by.
When a word $w$ appears in a text, it’s context us the set if words that appear nearby (with a fixed-size window).
Word vectors
Build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.
$$
banking =
\left[
\begin{array}{}
0.286 \\
0.792 \\
-0.177 \\
-0.107 \\
0.109 \\
-0.542 \\
0.349 \\
0.271 \\
\end{array}
\right]
$$
Note: word vectors are also called word embedding or (neural) word representations. They are a distribution representation.
Word2vec
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors.
Ideas:
- We have a large corpus of text.
- Every word in a fixed vocabulary is represented by a vector.
- Go through each position $t$ in the text, which has a center word $c$ and context (”outside”) words $o$ .
- Use the similarity of vectors for $c$ and $o$ to calculate the probability of $o$ given $c$ .
- Keep adjusting the word vectors to maximize this probability.
Objective function
For each position $t = 1, ..., T$ , predict context words within a window of fixed size $m$ , given center word $w_t$ . Data likelihood:
$$
Likelihood = L(\theta) = \prod{t = 1}^{T} \prod{-m \leq j \leq m, j \neq 0} P(w_{t+j} | w_t; \theta)
$$
The objective function $J(\theta)$ is the (average) negative log likelihood:
$$
J(\theta) = - \frac{1}{T} \sum{t = 1}^{T} \sum{-m \leq j \leq m, j \neq 0} log{P(w_{t+j}|w_t; \theta)}
$$
Minimizing objective function $\Leftrightarrow$ maximizing predictive accuracy
Now we want to minimize the objective function.
Q: How to calculate $P(w_{t+j}|w_t; \theta)$ ?
A: We will use tow vectors per word $w$ :
- $v_w$ when $w$ is a center word
- $u_w$ when $w$ is a context word
Then for a center word $c$ and a context word $o$ :
$$
P(o|c) = \frac{\exp{u_o^T vc}}{\sum{w \in V} \exp{u_w^T v_c}}
$$
In fact, this is an example for softmax function:
$$
softmax(x_i) = \frac{\exp{xi}}{\sum{j = 1}^n \exp{x_j}}
$$
The softmax function maps arbitrary values $x_i$ to a probability distribution $p_i$ . This is frequently used in Deep Learning.
Optimize
The objective function $J(\theta)$ is also called cost or loss function. To train a model, we gradually adjust parameters to minimize a loss.
Recall $\theta$ represents all the model parameters in one long vector. We assume that with $d$ dimensional vectors and $V$ many words, while each word has two vectors. So the $\theta$ will be a $2 d V$ long vector.
We optimize these parameters by walking down the gradient, we compute all vector gradients.
To minimize the objective function, we should calculate its derived function. This is the same to calculate $\log{P(o|c)}$ ‘s derived function.
Word2vec derivations of gradient
Useful basic fact:
$$
\frac{\partial x^T a}{\partial x} = \frac{\partial a^T x}{\partial x} = a
$$
Chain Rule:
$$
\frac{d y}{d x} = \frac{d y}{d u} \frac{d u}{d x} = \frac{d f(u)}{d u} \frac{d g(x)}{d x} =
$$
And finally we could get:
$$
\frac{\partial}{\partial v_c} \log{P(o|c)} = uo - \sum{x = 1}^V P(x|c) u_x
$$
Which just means $observed - expected$
Gradient Descent
Gradient Descent is an algorithm to minimize $J(\theta)$ .
Idea: for current value of $\theta$ , calculate greadien of $J(\theta)$ , then take small step in direction of negtive gradient. Repeat.
Update equation (in matrix notation)
$$
\theta^{new} = \theta^{old} - \alpha \nabla_{\theta} J(\theta)
$$
Update equation (for single parameter)
$$
\theta^{new}_j = \theta^{old}_j - \alpha \frac{\partial}{\partial \theta^{old}_j} J(\theta)
$$
Algorithm code:
while True:
theta_grad = evaluate_gradient(J, corpus, theta)
theta = theta - alpha * theta_grad
Stochastic Gradient Descent
Problem: $J(\theta)$ is a function for all windows in the corpus (potentially billions), so $\nabla_{\theta}J(\theta)$ is very expensive to compute. You would wait a very long time before making a single update. That’s a bad idea for pretty much all neural nets.
Solution: Stochastic Gradient Descent (SGD), repeatedly sample windows, and update after each one.
Algorithm code:
while True:
window = sample_window(corpus)
theta_grad = evaluate_gradient(J, window, theta)
theta = theta - alpha * theta_grad
Comments NOTHING