Natural Language Processing with RNNs and LSTMs

ai machine-learning nlp

Before BERT and GPT changed everything, Recurrent Neural Networks (RNNs) and their evolved form, Long Short-Term Memory networks (LSTMs), were the dominant architectures for NLP. Understanding them provides crucial context for modern approaches.

Why Sequences Need Special Treatment

Traditional neural networks assume inputs are independent. But language is inherently sequential—the meaning of a word depends on context.

“The bank was flooded” could mean:

The preceding words determine the meaning. We need models that understand sequence and context.

Recurrent Neural Networks (RNNs)

RNNs process sequences by maintaining a hidden state that gets updated at each step.

# Pseudocode for RNN
hidden_state = initial_state

for word in sentence:
    hidden_state = update(hidden_state, word)
    output = compute_output(hidden_state)

The hidden state is the network’s “memory”—it carries information from previous steps.

The Math

At each time step t:

h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b)
y_t = W_hy * h_t

Where:

The Vanishing Gradient Problem

RNNs struggle with long sequences. During backpropagation, gradients either:

This makes it hard to learn long-range dependencies. The network “forgets” early tokens.

Long Short-Term Memory (LSTM)

LSTMs solve the vanishing gradient problem with a more sophisticated cell structure.

The Key Innovation: Gates

LSTMs have three gates that control information flow:

  1. Forget Gate: What to discard from memory
  2. Input Gate: What new information to add
  3. Output Gate: What to output from memory

Plus a cell state that carries information across long distances.

# LSTM pseudocode
def lstm_cell(x_t, h_prev, c_prev):
    # Forget gate - what to forget from cell state
    f_t = sigmoid(W_f @ [h_prev, x_t] + b_f)
    
    # Input gate - what new info to add
    i_t = sigmoid(W_i @ [h_prev, x_t] + b_i)
    
    # Candidate values to add
    c_candidate = tanh(W_c @ [h_prev, x_t] + b_c)
    
    # Update cell state
    c_t = f_t * c_prev + i_t * c_candidate
    
    # Output gate - what to output
    o_t = sigmoid(W_o @ [h_prev, x_t] + b_o)
    
    # Hidden state
    h_t = o_t * tanh(c_t)
    
    return h_t, c_t

Why This Works

The cell state c_t acts like a conveyor belt. Information can flow unchanged across many steps (if forget gate ≈ 1 and input gate ≈ 0).

This creates a gradient highway, solving the vanishing gradient problem.

Practical Implementation

Using PyTorch:

import torch
import torch.nn as nn

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(
            embedding_dim, 
            hidden_dim, 
            batch_first=True,
            bidirectional=True
        )
        self.fc = nn.Linear(hidden_dim * 2, num_classes)
    
    def forward(self, x):
        # x: (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)
        
        lstm_out, (hidden, cell) = self.lstm(embedded)
        # lstm_out: (batch, seq_len, hidden*2)
        
        # Use last hidden state from both directions
        hidden_cat = torch.cat([hidden[-2], hidden[-1]], dim=1)
        
        return self.fc(hidden_cat)

Common NLP Tasks with LSTMs

Sentiment Analysis

# Input: "This movie was great"
# Output: Positive (0.92)
model = TextClassifier(vocab_size=10000, embedding_dim=128, 
                       hidden_dim=256, num_classes=2)

Named Entity Recognition

# Input: "John works at Google in California"
# Output: [B-PER, O, O, B-ORG, O, B-LOC]
class NERModel(nn.Module):
    def __init__(self, ...):
        self.lstm = nn.LSTM(..., bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, num_tags)  # Per-token output

Machine Translation (Seq2Seq)

# Encoder-decoder architecture
class Encoder(nn.Module):
    def __init__(self, ...):
        self.lstm = nn.LSTM(...)
    
    def forward(self, x):
        outputs, (hidden, cell) = self.lstm(x)
        return hidden, cell

class Decoder(nn.Module):
    def __init__(self, ...):
        self.lstm = nn.LSTM(...)
    
    def forward(self, x, hidden, cell):
        output, (hidden, cell) = self.lstm(x, (hidden, cell))
        return output, hidden, cell

GRUs: A Simpler Alternative

Gated Recurrent Units (GRUs) simplify LSTMs:

# Two gates instead of three
def gru_cell(x_t, h_prev):
    z_t = sigmoid(W_z @ [h_prev, x_t])  # Update gate
    r_t = sigmoid(W_r @ [h_prev, x_t])  # Reset gate
    
    h_candidate = tanh(W @ [r_t * h_prev, x_t])
    h_t = (1 - z_t) * h_prev + z_t * h_candidate
    
    return h_t

GRUs train faster with similar performance. Use them when LSTM seems like overkill.

Limitations

Despite their power, RNNs/LSTMs have inherent limitations:

  1. Sequential processing: Can’t parallelize across time steps
  2. Still struggle with very long sequences: 100+ tokens remain challenging
  3. Fixed representation: Single hidden vector must capture all context

These limitations led to attention mechanisms and eventually Transformers/BERT.

When to Use RNNs/LSTMs Today

With Transformers dominating, when are RNNs still useful?

Final Thoughts

RNNs and LSTMs were revolutionary. They enabled machine translation, speech recognition, and text generation that seemed impossible before.

Understanding them illuminates why Transformers work—they solve LSTM limitations with parallel attention over all positions.

The principles of gating and memory management influence modern architectures. Even if you never implement an LSTM from scratch, the intuitions transfer.


Know where we’ve been to understand where we’re going.

All posts