Module 6 — NLP & Large Language Models intermediate 25 min
Word Embeddings
The Problem with Bag-of-Words
TF-IDF treats every word as an independent symbol:
- “king” and “queen” are completely unrelated (to the model)
- “Paris” and “London” are unrelated
- “happy” and “joyful” are unrelated
But these words clearly have relationships! We need a way to encode meaning.
What is a Word Embedding?
A word embedding maps each word to a dense vector of numbers that captures its meaning. Similar words end up with similar vectors.
# Conceptual example (not actual values)
king = [0.3, 0.9, -0.2, 0.7, ...] # 300-dimensional vector
queen = [0.2, 0.8, 0.8, 0.7, ...] # similar to king!
paris = [0.9, -0.1, 0.1, 0.3, ...]
london = [0.8, -0.2, 0.1, 0.4, ...] # similar to Paris!
pizza = [-0.3, 0.1, 0.7, -0.5, ...] # very different from king
The Famous Analogy Property
Word embeddings capture relationships:
king - man + woman ≈ queen
Paris - France + Germany ≈ Berlin
Word2Vec
Word2Vec (Google, 2013) learns embeddings by predicting words from context:
# Using Gensim for Word2Vec
from gensim.models import Word2Vec
import numpy as np
# Toy corpus (real training uses millions of sentences)
sentences = [
["machine", "learning", "is", "powerful"],
["deep", "learning", "uses", "neural", "networks"],
["neural", "networks", "are", "powerful"],
["natural", "language", "processing", "understands", "text"],
["word", "embeddings", "capture", "semantic", "meaning"],
["king", "and", "queen", "are", "royalty"],
["man", "and", "woman", "are", "humans"],
]
# Train Word2Vec
model = Word2Vec(
sentences=sentences,
vector_size=50, # embedding dimension
window=3, # context window size
min_count=1, # ignore words with freq < min_count
workers=4,
epochs=100,
)
# Access embedding for a word
print("Vector for 'learning':", model.wv["learning"][:5], "...")
# Most similar words
print("\nMost similar to 'neural':")
for word, score in model.wv.most_similar("neural", topn=3):
print(f" {word}: {score:.4f}")
# Similarity between two words
sim = model.wv.similarity("machine", "deep")
print(f"\nSimilarity(machine, deep): {sim:.4f}")
Using Pre-trained Embeddings
Training Word2Vec from scratch needs a huge corpus. Instead, use pre-trained vectors:
# Using Gensim's downloader to get pre-trained GloVe embeddings
import gensim.downloader as api
# Download GloVe (100-dimensional, trained on Wikipedia)
# First run downloads ~130MB
glove = api.load("glove-wiki-gigaword-100")
# Word similarity
print(f"cat vs dog: {glove.similarity('cat', 'dog'):.4f}") # high ~0.92
print(f"cat vs table: {glove.similarity('cat', 'table'):.4f}") # low ~0.15
# Most similar
print("\nMost similar to 'python':")
for w, s in glove.most_similar("python", topn=5):
print(f" {w}: {s:.3f}")
# Analogy: king - man + woman
vec = glove["king"] - glove["man"] + glove["woman"]
results = glove.most_similar(positive=[vec], topn=3)
print(f"\nking - man + woman ≈ {results[0][0]}") # queen
Using Embeddings in PyTorch
import torch
import torch.nn as nn
# Vocabulary size and embedding dimension
vocab_size = 10000
embed_dim = 128
# Embedding layer
embedding = nn.Embedding(vocab_size, embed_dim)
# Word indices for a batch of sentences
# (batch_size=2, sequence_length=5)
word_indices = torch.tensor([
[5, 23, 7, 42, 0], # sentence 1: "the cat sat on mat"
[12, 3, 89, 1, 0], # sentence 2: "dogs are loyal animals"
])
# Get embeddings
embedded = embedding(word_indices)
print(embedded.shape) # torch.Size([2, 5, 128])
# Load pretrained GloVe into PyTorch embedding layer
def load_glove_weights(glove_model, vocab, embed_dim):
"""Create an embedding matrix from GloVe vectors."""
weights = torch.zeros(len(vocab), embed_dim)
found = 0
for i, word in enumerate(vocab):
if word in glove_model:
weights[i] = torch.tensor(glove_model[word])
found += 1
print(f"Loaded {found}/{len(vocab)} words from GloVe")
return weights
Contextual Embeddings (Modern Approach)
A limitation of Word2Vec: a word has only one embedding regardless of context.
- “bank” (financial) vs “bank” (river bank) → same vector!
Modern models like BERT solve this with contextual embeddings — the vector depends on the surrounding words.
# Using HuggingFace transformers (covered in detail next lesson)
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
def get_embedding(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
# Use [CLS] token embedding as sentence representation
return outputs.last_hidden_state[:, 0, :].squeeze().numpy()
e1 = get_embedding("I went to the bank to deposit money")
e2 = get_embedding("The river bank was beautiful")
e3 = get_embedding("She deposited her savings at the bank")
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
print("Same 'bank' (financial):", cosine_similarity([e1], [e3])[0][0].round(3))
print("Different 'bank': ", cosine_similarity([e1], [e2])[0][0].round(3))
# Same context gives higher similarity!
Word2Vec captures that 'king - man + woman = queen'. What property of word embeddings does this demonstrate?