BERT: Pre-trained Language Models Change NLP

October 7, 2018

ai machine-learning nlp transformers

Google just dropped a bombshell. BERT (Bidirectional Encoder Representations from Transformers) achieved state-of-the-art results on 11 NLP tasks, often by a significant margin.

This isn’t just another incremental improvement. It’s a paradigm shift.

The Old Way: Task-Specific Models

Traditional NLP required training models from scratch for each task:

Sentiment analysis? Train a model.
Question answering? Train another model.
Named entity recognition? Another one.

Each model needed labeled data, extensive tuning, and task-specific architectures.

The New Way: Pre-train, Then Fine-tune

BERT introduces a two-phase approach:

Phase 1: Pre-training

Train once on massive unlabeled text (Wikipedia, BooksCorpus). Learn general language understanding.

Phase 2: Fine-tuning

Take the pre-trained model, add a simple output layer, fine-tune on your specific task with much less data.

This is transfer learning for NLP, and it works spectacularly.

What Makes BERT Special

Bidirectional Context

Previous models read text left-to-right or right-to-left. BERT reads in both directions simultaneously.

Consider: “The bank of the river was flooded.”

Left-to-right sees: “The bank” → Could be financial or river
Bidirectional sees: “The bank of the river” → Clearly river bank

Masked Language Modeling

To train bidirectionally, BERT uses a clever trick: mask random words and predict them.

Input:  "The [MASK] sat on the mat"
Output: "cat" (predicted)

This forces the model to understand context from both directions.

Next Sentence Prediction

BERT also learns relationships between sentences:

Sentence A: "The man went to the store."
Sentence B: "He bought some milk."
Label: IsNext (these sentences follow each other)

This helps tasks like question answering where understanding sentence pairs matters.

The Architecture

BERT uses the Transformer encoder (the left half of “Attention Is All You Need”):

BERT-Base: 12 layers, 768 hidden, 12 attention heads, 110M parameters
BERT-Large: 24 layers, 1024 hidden, 16 attention heads, 340M parameters

No decoder needed—BERT is for understanding, not generation.

Fine-Tuning for Your Task

The magic of BERT is how simple fine-tuning becomes:

Sentence Classification (Sentiment)

from transformers import BertForSequenceClassification, BertTokenizer

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Your training loop
inputs = tokenizer("I love this movie!", return_tensors="pt")
outputs = model(**inputs, labels=torch.tensor([1]))  # 1 = positive
loss = outputs.loss

Question Answering

from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

question = "What is the capital of France?"
context = "France is a country in Europe. Paris is the capital of France."

# Fine-tune to extract "Paris" from context

Named Entity Recognition

from transformers import BertForTokenClassification

model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=9)

# Fine-tune to tag: [O, B-PER, I-PER, B-ORG, I-ORG, ...]

Results That Shocked NLP

BERT crushed previous state-of-the-art:

SQuAD 1.1: 93.2 F1 (human: 91.2)
GLUE Benchmark: 80.5 average (previous: 72.8)
SWAG: 86.3% (previous: 59.5%)

The SWAG improvement is particularly striking—nearly 27 percentage points.

Using Pre-trained BERT

Hugging Face makes it easy:

pip install transformers

from transformers import pipeline

# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Question answering
qa = pipeline("question-answering")
result = qa(
    question="What is BERT?",
    context="BERT is a language model from Google."
)
# {'answer': 'a language model from Google', ...}

Training Cost

Pre-training BERT-Large took:

4 days
64 TPU chips
Estimated cost: $6,000-$50,000

This is why pre-training once and fine-tuning many times is so powerful. You amortize the cost across many downstream tasks.

Limitations

BERT isn’t perfect:

Size: 340M parameters is large for edge devices
Inference speed: Slower than simpler models
Max length: 512 tokens limit
Not generative: BERT understands but doesn’t generate text

For generation tasks, look to GPT-style models.

What This Means for NLP

BERT represents a new era:

Less task-specific engineering: Focus on data, not architecture
Lower data requirements: Fine-tuning works with smaller datasets
Accessible state-of-the-art: Pre-trained models are freely available
Commoditization of NLP: High-quality understanding is becoming standard

Getting Started

Start with Hugging Face pipelines for quick prototypes
Fine-tune on your task with the Trainer API
Explore specialized models (DistilBERT for speed, RoBERTa for accuracy)
Consider task-specific heads and training strategies

The pre-trained model revolution is here. Don’t train from scratch.

The era of pre-training has begun.