Word Embeddings — ML Course

The Problem with Bag-of-Words

TF-IDF treats every word as an independent symbol:

“king” and “queen” are completely unrelated (to the model)
“Paris” and “London” are unrelated
“happy” and “joyful” are unrelated

But these words clearly have relationships! We need a way to encode meaning.

What is a Word Embedding?

A word embedding maps each word to a dense vector of numbers that captures its meaning. Similar words end up with similar vectors.

# Conceptual example (not actual values)
king   = [0.3,  0.9, -0.2,  0.7, ...]    # 300-dimensional vector
queen  = [0.2,  0.8,  0.8,  0.7, ...]    # similar to king!
paris  = [0.9, -0.1,  0.1,  0.3, ...]
london = [0.8, -0.2,  0.1,  0.4, ...]    # similar to Paris!
pizza  = [-0.3, 0.1,  0.7, -0.5, ...]    # very different from king

The Famous Analogy Property

Word embeddings capture relationships:

king - man + woman ≈ queen
Paris - France + Germany ≈ Berlin

Word2Vec

Word2Vec (Google, 2013) learns embeddings by predicting words from context:

# Using Gensim for Word2Vec
from gensim.models import Word2Vec
import numpy as np

# Toy corpus (real training uses millions of sentences)
sentences = [
    ["machine", "learning", "is", "powerful"],
    ["deep", "learning", "uses", "neural", "networks"],
    ["neural", "networks", "are", "powerful"],
    ["natural", "language", "processing", "understands", "text"],
    ["word", "embeddings", "capture", "semantic", "meaning"],
    ["king", "and", "queen", "are", "royalty"],
    ["man", "and", "woman", "are", "humans"],
]

# Train Word2Vec
model = Word2Vec(
    sentences=sentences,
    vector_size=50,    # embedding dimension
    window=3,          # context window size
    min_count=1,       # ignore words with freq < min_count
    workers=4,
    epochs=100,
)

# Access embedding for a word
print("Vector for 'learning':", model.wv["learning"][:5], "...")

# Most similar words
print("\nMost similar to 'neural':")
for word, score in model.wv.most_similar("neural", topn=3):
    print(f"  {word}: {score:.4f}")

# Similarity between two words
sim = model.wv.similarity("machine", "deep")
print(f"\nSimilarity(machine, deep): {sim:.4f}")

Using Pre-trained Embeddings

Training Word2Vec from scratch needs a huge corpus. Instead, use pre-trained vectors:

# Using Gensim's downloader to get pre-trained GloVe embeddings
import gensim.downloader as api

# Download GloVe (100-dimensional, trained on Wikipedia)
# First run downloads ~130MB
glove = api.load("glove-wiki-gigaword-100")

# Word similarity
print(f"cat vs dog: {glove.similarity('cat', 'dog'):.4f}")      # high ~0.92
print(f"cat vs table: {glove.similarity('cat', 'table'):.4f}")  # low ~0.15

# Most similar
print("\nMost similar to 'python':")
for w, s in glove.most_similar("python", topn=5):
    print(f"  {w}: {s:.3f}")

# Analogy: king - man + woman
vec = glove["king"] - glove["man"] + glove["woman"]
results = glove.most_similar(positive=[vec], topn=3)
print(f"\nking - man + woman ≈ {results[0][0]}")  # queen

Using Embeddings in PyTorch

import torch
import torch.nn as nn

# Vocabulary size and embedding dimension
vocab_size = 10000
embed_dim  = 128

# Embedding layer
embedding = nn.Embedding(vocab_size, embed_dim)

# Word indices for a batch of sentences
# (batch_size=2, sequence_length=5)
word_indices = torch.tensor([
    [5,  23, 7,  42, 0],   # sentence 1: "the cat sat on mat"
    [12, 3,  89, 1,  0],   # sentence 2: "dogs are loyal animals"
])

# Get embeddings
embedded = embedding(word_indices)
print(embedded.shape)  # torch.Size([2, 5, 128])

# Load pretrained GloVe into PyTorch embedding layer
def load_glove_weights(glove_model, vocab, embed_dim):
    """Create an embedding matrix from GloVe vectors."""
    weights = torch.zeros(len(vocab), embed_dim)
    found = 0
    for i, word in enumerate(vocab):
        if word in glove_model:
            weights[i] = torch.tensor(glove_model[word])
            found += 1
    print(f"Loaded {found}/{len(vocab)} words from GloVe")
    return weights

Contextual Embeddings (Modern Approach)

A limitation of Word2Vec: a word has only one embedding regardless of context.

“bank” (financial) vs “bank” (river bank) → same vector!

Modern models like BERT solve this with contextual embeddings — the vector depends on the surrounding words.

# Using HuggingFace transformers (covered in detail next lesson)
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use [CLS] token embedding as sentence representation
    return outputs.last_hidden_state[:, 0, :].squeeze().numpy()

e1 = get_embedding("I went to the bank to deposit money")
e2 = get_embedding("The river bank was beautiful")
e3 = get_embedding("She deposited her savings at the bank")

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

print("Same 'bank' (financial):", cosine_similarity([e1], [e3])[0][0].round(3))
print("Different 'bank':       ", cosine_similarity([e1], [e2])[0][0].round(3))
# Same context gives higher similarity!

Knowledge Check

Word2Vec captures that 'king - man + woman = queen'. What property of word embeddings does this demonstrate?