Intro & Word Vectors

Eventhough most of deep learing and even word vector learning seems like magic, that it’s not really magic. It’s really just doing math. —— Chris Manning

Meaning of a word

Definition of meaning:

the idea that is represented by a word, phrase, etc.
the idea that a person wants to express by using words, sings, etc.
the idea that is expressed in a work of writing, art, etc.

Commonest linguistic way of thinking of meaning

signifier (symbol) $\Leftrightarrow$ signified (idea or thing)

That’s denotational semantics (语义). Like the word “chair” means a chair.

But this model is not very deeply implementable. We don’t know how to construct a connection from a word “chair” to something in the true world.

Common NLP solution

Use dictionaries like WorldNet which contain lists of synonym sets and hypernyms (”is a” relationship).

The problem:

Great as a resource but missing nuance
Missing new meaning of words
Subjective
Requires human labor to create and adapt
Can’t compute accurate word similarity

Representing words as discrete symbols

each word was encode into an one-hot vector (vector only containing zero and one), like:

motel = [0, 0, 0, 1]
hotel = [1, 0, 1, 0]

The problem:

We don’t know the relationship and similarity between them.

Even though we can use dictionaries to construct the connection, it’s still incompleteness.

Modern deep learning method: representing words by their context

You should know a word by the company it keeps. ——J. R. Firth

Distribution semantics: A word’s meaning is given by the words that frequently appear close-by.

When a word $w$ appears in a text, it’s context us the set if words that appear nearby (with a fixed-size window).

Word vectors

Build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.

$$
banking =
\left[
\begin{array}{}
0.286 \\
0.792 \\
-0.177 \\
-0.107 \\
0.109 \\
-0.542 \\
0.349 \\
0.271 \\
\end{array}
\right]
$$

Note: word vectors are also called word embedding or (neural) word representations. They are a distribution representation.

Word2vec

Word2vec (Mikolov et al. 2013) is a framework for learning word vectors.

Ideas:

We have a large corpus of text.
Every word in a fixed vocabulary is represented by a vector.
Go through each position $t$ in the text, which has a center word $c$ and context (”outside”) words $o$ .
Use the similarity of vectors for $c$ and $o$ to calculate the probability of $o$ given $c$ .
Keep adjusting the word vectors to maximize this probability.

Objective function

For each position $t = 1, ..., T$ , predict context words within a window of fixed size $m$ , given center word $w_t$ . Data likelihood:

$$
Likelihood = L(\theta) = \prod{t = 1}^{T} \prod{-m \leq j \leq m, j \neq 0} P(w_{t+j} | w_t; \theta)
$$

The objective function $J(\theta)$ is the (average) negative log likelihood:

$$
J(\theta) = - \frac{1}{T} \sum{t = 1}^{T} \sum{-m \leq j \leq m, j \neq 0} log{P(w_{t+j}|w_t; \theta)}
$$

Minimizing objective function $\Leftrightarrow$ maximizing predictive accuracy

Now we want to minimize the objective function.

Q: How to calculate $P(w_{t+j}|w_t; \theta)$ ?

A: We will use tow vectors per word $w$ :

$v_w$ when $w$ is a center word
$u_w$ when $w$ is a context word

Then for a center word $c$ and a context word $o$ :

$$
P(o|c) = \frac{\exp{u_o^T vc}}{\sum{w \in V} \exp{u_w^T v_c}}
$$

In fact, this is an example for softmax function:

$$
softmax(x_i) = \frac{\exp{xi}}{\sum{j = 1}^n \exp{x_j}}
$$

The softmax function maps arbitrary values $x_i$ to a probability distribution $p_i$ . This is frequently used in Deep Learning.

Optimize

The objective function $J(\theta)$ is also called cost or loss function. To train a model, we gradually adjust parameters to minimize a loss.

Recall $\theta$ represents all the model parameters in one long vector. We assume that with $d$ dimensional vectors and $V$ many words, while each word has two vectors. So the $\theta$ will be a $2 d V$ long vector.

We optimize these parameters by walking down the gradient, we compute all vector gradients.

To minimize the objective function, we should calculate its derived function. This is the same to calculate $\log{P(o|c)}$ ‘s derived function.

Word2vec derivations of gradient

Useful basic fact:

$$
\frac{\partial x^T a}{\partial x} = \frac{\partial a^T x}{\partial x} = a
$$

Chain Rule:
$$
\frac{d y}{d x} = \frac{d y}{d u} \frac{d u}{d x} = \frac{d f(u)}{d u} \frac{d g(x)}{d x} =
$$

And finally we could get:

$$
\frac{\partial}{\partial v_c} \log{P(o|c)} = uo - \sum{x = 1}^V P(x|c) u_x
$$

Which just means $observed - expected$

Gradient Descent

Gradient Descent is an algorithm to minimize $J(\theta)$ .

Idea: for current value of $\theta$ , calculate greadien of $J(\theta)$ , then take small step in direction of negtive gradient. Repeat.

Update equation (in matrix notation)

$$
\theta^{new} = \theta^{old} - \alpha \nabla_{\theta} J(\theta)
$$

Update equation (for single parameter)

$$
\theta^{new}_j = \theta^{old}_j - \alpha \frac{\partial}{\partial \theta^{old}_j} J(\theta)
$$

Algorithm code:

while True:
    theta_grad = evaluate_gradient(J, corpus, theta)
    theta = theta - alpha * theta_grad

Stochastic Gradient Descent

Problem: $J(\theta)$ is a function for all windows in the corpus (potentially billions), so $\nabla_{\theta}J(\theta)$ is very expensive to compute. You would wait a very long time before making a single update. That’s a bad idea for pretty much all neural nets.

Solution: Stochastic Gradient Descent (SGD), repeatedly sample windows, and update after each one.