Assignment 1 Word Vectors
Part 1 Count-Based Word Vectors
Construct Steps
- Download and read corpus.
- Find out every unique words in the corpus and put them all in one list named
words
. And construct a mapping from word to its index in the list namedword2ind
. - With the corpus,
words
andword2ind
, construct a co-occurrence matrix. - Reduced the co-occurrence matrix from $n n$ dimensions to $n k$ dimensions using SVD.
- Plot the reduced co-occurrence matrix which is $n * 2$ dimensions.
Co-occurrence matrix
Assume we have $n$ words in the corpus. The co-occurrence matrix will be $n * n$ dimensions. And for element $X_{ij}$ , it means MARKDOWN_HASH57742ebb2400726b5baab0646f4955d5MARKDOWNHASH
appear $X{ij}$ times in the window whose center word is words[i]
.
And we should know that every sentence in the corpus is inserted a sign like <START>
in its beginning and <END>
in its end which will also be count in the co-occurrence matrix.
For example:
Sentence 1: “all that glitters is not gold”
Sentence 2: “all is well that ends well”
Then the matrix will be like:
* | all | that | glitters | is | not | gold | well | ends | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
all | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
that | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
glitters | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
is | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
not | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
gold | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
well | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
ends | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
Singular Value Decomposition (SVD)
In simple terms, we have a matrix $M$ which has $n m$ dimensions , we could write that $M = U \Sigma V^T$ , where $U$ is a $n n$ matrix, $V$ is a $m * m$ matrix and $\Sigma$ is a diagonal matrix and the values are $\sigma_1, ..., \sigma_t$ , $t$ is up to the minimum of $n$ and $m$ .
We could cut off some dimensions to compress information. Assume $m \leq n$ , we could discard some dimensions so that $U$ has $n k$ dimensions, $\Sigma$ has $k$ values, and $V^T$ has $k m$ dimensions, now. Then the new $M$ still has much information of the original matrix. The larger the $k$ is, the more the information will be retained by matrix.
In practice, we can use sklearn.decomposition.TruncatedSVD
.
For example:
def reduce_to_k_dim(M, k=2):
""" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
Params:
M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
k (int): embedding size of each word after dimension reduction
Return:
M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
In terms of the SVD from math class, this actually returns U * S
"""
n_iters = 10 # Use this parameter in your call to `TruncatedSVD`
M_reduced = None
print("Running Truncated SVD over %i words..." % (M.shape[0]))
# ------------------
# Write your implementation here.
# define TSVD with specific parameters
# n_components specify the output dimension (reduced)
# n_iter ?
svd = TruncatedSVD(n_components = k, n_iter = n_iters)
M_reduced = svd.fit_transform(M)
# ------------------
print("Done.")
return M_reduced
Part 2 Prediction-Based Word Vectors
Typical model like word2vec and GloVe. In this assignment, we use GloVe.
Construct Steps
Considering PC performance, we first draw 10000 words and corresponding vectors randomly from the model. Then, make some completion to ensure the new words list contain all original words.
For now, we get matrix $M$ , words list and word2ind again. We reduce dimension of $M$ using the same way. Further more, we normalize the element in matrix to make sure the value is between 0 and 1 which represents a probability. The technique used named broadcasting. That’s not hadr to understand, you can learn it by yourself. The code will be like:
M_lengths = np.linalg.norm(M_reduced, axis=1)
M_reduced_normalized = M_reduced / M_lengths[:, np.newaxis] # broadcasting
Now we can plot again.
Cosine Similarity
For two vectors, we could compare the angle between them to determine similarity. In practice, we calculate:
$$
s = \frac{p \cdot q}{||p||||q||}
$$
as their similarity which is an cosine value between -1 and 1.
And with most_similar()
function, we could get a list of words which are most similar to the given one. Or we could specify the positive=[]
and negative=[]
parameters to do analogy.
With distance()
function, we could get the similarity distance between two given words.
Comments NOTHING