Natural Language Processing with RNNs and LSTMs
Before BERT and GPT changed everything, Recurrent Neural Networks (RNNs) and their evolved form, Long Short-Term Memory networks (LSTMs), were the dominant architectures for NLP. Understanding them provides crucial context for modern approaches.
Why Sequences Need Special Treatment
Traditional neural networks assume inputs are independent. But language is inherently sequential—the meaning of a word depends on context.
“The bank was flooded” could mean:
- A financial institution had a plumbing issue
- A riverbank overflowed
The preceding words determine the meaning. We need models that understand sequence and context.
Recurrent Neural Networks (RNNs)
RNNs process sequences by maintaining a hidden state that gets updated at each step.
# Pseudocode for RNN
hidden_state = initial_state
for word in sentence:
hidden_state = update(hidden_state, word)
output = compute_output(hidden_state)
The hidden state is the network’s “memory”—it carries information from previous steps.
The Math
At each time step t:
h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b)
y_t = W_hy * h_t
Where:
h_tis the hidden statex_tis the input (word embedding)W_*are weight matricesy_tis the output
The Vanishing Gradient Problem
RNNs struggle with long sequences. During backpropagation, gradients either:
- Vanish: Multiply many small numbers → gradients approach zero
- Explode: Multiply many large numbers → gradients overflow
This makes it hard to learn long-range dependencies. The network “forgets” early tokens.
Long Short-Term Memory (LSTM)
LSTMs solve the vanishing gradient problem with a more sophisticated cell structure.
The Key Innovation: Gates
LSTMs have three gates that control information flow:
- Forget Gate: What to discard from memory
- Input Gate: What new information to add
- Output Gate: What to output from memory
Plus a cell state that carries information across long distances.
# LSTM pseudocode
def lstm_cell(x_t, h_prev, c_prev):
# Forget gate - what to forget from cell state
f_t = sigmoid(W_f @ [h_prev, x_t] + b_f)
# Input gate - what new info to add
i_t = sigmoid(W_i @ [h_prev, x_t] + b_i)
# Candidate values to add
c_candidate = tanh(W_c @ [h_prev, x_t] + b_c)
# Update cell state
c_t = f_t * c_prev + i_t * c_candidate
# Output gate - what to output
o_t = sigmoid(W_o @ [h_prev, x_t] + b_o)
# Hidden state
h_t = o_t * tanh(c_t)
return h_t, c_t
Why This Works
The cell state c_t acts like a conveyor belt. Information can flow unchanged across many steps (if forget gate ≈ 1 and input gate ≈ 0).
This creates a gradient highway, solving the vanishing gradient problem.
Practical Implementation
Using PyTorch:
import torch
import torch.nn as nn
class TextClassifier(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(
embedding_dim,
hidden_dim,
batch_first=True,
bidirectional=True
)
self.fc = nn.Linear(hidden_dim * 2, num_classes)
def forward(self, x):
# x: (batch, seq_len)
embedded = self.embedding(x) # (batch, seq_len, embed_dim)
lstm_out, (hidden, cell) = self.lstm(embedded)
# lstm_out: (batch, seq_len, hidden*2)
# Use last hidden state from both directions
hidden_cat = torch.cat([hidden[-2], hidden[-1]], dim=1)
return self.fc(hidden_cat)
Common NLP Tasks with LSTMs
Sentiment Analysis
# Input: "This movie was great"
# Output: Positive (0.92)
model = TextClassifier(vocab_size=10000, embedding_dim=128,
hidden_dim=256, num_classes=2)
Named Entity Recognition
# Input: "John works at Google in California"
# Output: [B-PER, O, O, B-ORG, O, B-LOC]
class NERModel(nn.Module):
def __init__(self, ...):
self.lstm = nn.LSTM(..., bidirectional=True)
self.fc = nn.Linear(hidden_dim * 2, num_tags) # Per-token output
Machine Translation (Seq2Seq)
# Encoder-decoder architecture
class Encoder(nn.Module):
def __init__(self, ...):
self.lstm = nn.LSTM(...)
def forward(self, x):
outputs, (hidden, cell) = self.lstm(x)
return hidden, cell
class Decoder(nn.Module):
def __init__(self, ...):
self.lstm = nn.LSTM(...)
def forward(self, x, hidden, cell):
output, (hidden, cell) = self.lstm(x, (hidden, cell))
return output, hidden, cell
GRUs: A Simpler Alternative
Gated Recurrent Units (GRUs) simplify LSTMs:
- Combine forget and input gates into a single “update gate”
- Merge cell state and hidden state
# Two gates instead of three
def gru_cell(x_t, h_prev):
z_t = sigmoid(W_z @ [h_prev, x_t]) # Update gate
r_t = sigmoid(W_r @ [h_prev, x_t]) # Reset gate
h_candidate = tanh(W @ [r_t * h_prev, x_t])
h_t = (1 - z_t) * h_prev + z_t * h_candidate
return h_t
GRUs train faster with similar performance. Use them when LSTM seems like overkill.
Limitations
Despite their power, RNNs/LSTMs have inherent limitations:
- Sequential processing: Can’t parallelize across time steps
- Still struggle with very long sequences: 100+ tokens remain challenging
- Fixed representation: Single hidden vector must capture all context
These limitations led to attention mechanisms and eventually Transformers/BERT.
When to Use RNNs/LSTMs Today
With Transformers dominating, when are RNNs still useful?
- Streaming data: Real-time processing where you can’t wait for full sequence
- Resource constraints: Smaller models, less memory
- Simple sequence problems: Short sequences, limited data
- Understanding fundamentals: Foundation for learning attention/Transformers
Final Thoughts
RNNs and LSTMs were revolutionary. They enabled machine translation, speech recognition, and text generation that seemed impossible before.
Understanding them illuminates why Transformers work—they solve LSTM limitations with parallel attention over all positions.
The principles of gating and memory management influence modern architectures. Even if you never implement an LSTM from scratch, the intuitions transfer.
Know where we’ve been to understand where we’re going.