CS224N Assignment 1: Exploring Word Vectors

发布于 2023-01-07  64 次阅读


Assignment 1 Word Vectors

Part 1 Count-Based Word Vectors

Construct Steps

  1. Download and read corpus.
  2. Find out every unique words in the corpus and put them all in one list named words . And construct a mapping from word to its index in the list named word2ind.
  3. With the corpus, words and word2ind, construct a co-occurrence matrix.
  4. Reduced the co-occurrence matrix from $n n$ dimensions to $n k$ dimensions using SVD.
  5. Plot the reduced co-occurrence matrix which is $n * 2$ dimensions.

Co-occurrence matrix

Assume we have $n$ words in the corpus. The co-occurrence matrix will be $n * n$ dimensions. And for element $X_{ij}$ , it means MARKDOWN_HASH57742ebb2400726b5baab0646f4955d5MARKDOWNHASH appear $X{ij}$ times in the window whose center word is words[i] .

And we should know that every sentence in the corpus is inserted a sign like <START> in its beginning and <END> in its end which will also be count in the co-occurrence matrix.

For example:

Sentence 1: “all that glitters is not gold”

Sentence 2: “all is well that ends well”

Then the matrix will be like:

* all that glitters is not gold well ends
0 2 0 0 0 0 0 0 0 0
all 2 0 1 0 1 0 0 0 0 0
that 0 1 0 1 0 0 0 1 1 0
glitters 0 0 1 0 1 0 0 0 0 0
is 0 1 0 1 0 1 0 1 0 0
not 0 0 0 0 1 0 1 0 0 0
gold 0 0 0 0 0 1 0 0 0 1
well 0 0 1 0 1 0 0 0 1 1
ends 0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 1 1 0 0

Singular Value Decomposition (SVD)

In simple terms, we have a matrix $M$ which has $n m$ dimensions , we could write that $M = U \Sigma V^T$ , where $U$ is a $n n$ matrix, $V$ is a $m * m$ matrix and $\Sigma$ is a diagonal matrix and the values are $\sigma_1, ..., \sigma_t$ , $t$ is up to the minimum of $n$ and $m$ .

We could cut off some dimensions to compress information. Assume $m \leq n$ , we could discard some dimensions so that $U$ has $n k$ dimensions, $\Sigma$ has $k$ values, and $V^T$ has $k m$ dimensions, now. Then the new $M$ still has much information of the original matrix. The larger the $k$ is, the more the information will be retained by matrix.

In practice, we can use sklearn.decomposition.TruncatedSVD .

For example:

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

        Params:
            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))

    # ------------------
    # Write your implementation here.

    # define TSVD with specific parameters
    # n_components specify the output dimension (reduced)
    # n_iter ?
    svd = TruncatedSVD(n_components = k, n_iter = n_iters)
    M_reduced = svd.fit_transform(M)

    # ------------------

    print("Done.")
    return M_reduced

Part 2 Prediction-Based Word Vectors

Typical model like word2vec and GloVe. In this assignment, we use GloVe.

Construct Steps

Considering PC performance, we first draw 10000 words and corresponding vectors randomly from the model. Then, make some completion to ensure the new words list contain all original words.

For now, we get matrix $M$ , words list and word2ind again. We reduce dimension of $M$ using the same way. Further more, we normalize the element in matrix to make sure the value is between 0 and 1 which represents a probability. The technique used named broadcasting. That’s not hadr to understand, you can learn it by yourself. The code will be like:

M_lengths = np.linalg.norm(M_reduced, axis=1)
M_reduced_normalized = M_reduced / M_lengths[:, np.newaxis] # broadcasting  

Now we can plot again.

Cosine Similarity

For two vectors, we could compare the angle between them to determine similarity. In practice, we calculate:

$$
s = \frac{p \cdot q}{||p||||q||}
$$

as their similarity which is an cosine value between -1 and 1.

And with most_similar() function, we could get a list of words which are most similar to the given one. Or we could specify the positive=[] and negative=[] parameters to do analogy.

With distance() function, we could get the similarity distance between two given words.