Hugging Face Transformers: The New Standard Library
ai machine-learning nlp transformers
Before Hugging Face, using BERT meant reading papers, implementing architectures, and debugging for days. Now it’s pip install and three lines of code.
The Hugging Face Ecosystem
- Transformers: Pre-trained models for NLP, vision, audio
- Datasets: One-line access to hundreds of datasets
- Tokenizers: Fast tokenization in Rust
- Hub: Share and discover models
Getting Started
pip install transformers
from transformers import pipeline
# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Question answering
qa = pipeline("question-answering")
result = qa(
question="What is the capital of France?",
context="France is a country in Europe. Paris is its capital."
)
# {'answer': 'Paris', 'score': 0.98, ...}
That’s it. State-of-the-art NLP in three lines.
Available Pipelines
from transformers import pipeline
# Text Classification
classifier = pipeline("text-classification")
classifier("I hate Mondays")
# Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
ner("Hugging Face is based in New York City")
# Text Generation
generator = pipeline("text-generation")
generator("Once upon a time,", max_length=50)
# Summarization
summarizer = pipeline("summarization")
summarizer(long_text, max_length=130, min_length=30)
# Translation
translator = pipeline("translation_en_to_fr")
translator("Hello, how are you?")
# Fill-Mask
unmasker = pipeline("fill-mask")
unmasker("Paris is the [MASK] of France.")
# Zero-Shot Classification
classifier = pipeline("zero-shot-classification")
classifier(
"This is about cooking pasta",
candidate_labels=["sports", "cooking", "technology"]
)
Using Specific Models
from transformers import AutoTokenizer, AutoModel
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# Tokenize
inputs = tokenizer("Hello, world!", return_tensors="pt")
# Forward pass
outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
For Classification Tasks
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
inputs = tokenizer("I love transformers!", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions) # [negative_prob, positive_prob]
Fine-Tuning Your Own Model
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
Trainer,
TrainingArguments
)
from datasets import load_dataset
# Load dataset
dataset = load_dataset("imdb")
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=2
)
# Tokenize dataset
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True)
tokenized = dataset.map(tokenize, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
evaluation_strategy="epoch",
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
)
trainer.train()
Model Hub
Browse and download from 100,000+ models:
# Specific model from Hub
model = AutoModel.from_pretrained("microsoft/deberta-v3-base")
# Community model
model = AutoModel.from_pretrained("facebook/bart-large-cnn")
# Your uploaded model
model = AutoModel.from_pretrained("username/my-fine-tuned-model")
Sharing Models
# Login
from huggingface_hub import login
login()
# Push to Hub
model.push_to_hub("my-awesome-model")
tokenizer.push_to_hub("my-awesome-model")
Working with Datasets
from datasets import load_dataset
# Load common datasets
dataset = load_dataset("squad")
dataset = load_dataset("glue", "mrpc")
dataset = load_dataset("csv", data_files="data.csv")
# Explore
print(dataset["train"][0])
print(dataset["train"].features)
# Split
train_test = dataset["train"].train_test_split(test_size=0.2)
Performance Tips
Use Smaller Models
# Instead of bert-base (110M params)
model = AutoModel.from_pretrained("distilbert-base-uncased") # 66M
# Even smaller
model = AutoModel.from_pretrained("prajjwal1/bert-tiny") # 4M
Enable Fast Tokenizers
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
Use GPU
model = AutoModel.from_pretrained("bert-base-uncased").to("cuda")
inputs = tokenizer(text, return_tensors="pt").to("cuda")
Batch Processing
texts = ["First text", "Second text", "Third text"]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
Common Model Architectures
| Task | Recommended Models |
|---|---|
| Text Classification | DistilBERT, RoBERTa, DeBERTa |
| Question Answering | RoBERTa, DeBERTa, ELECTRA |
| Summarization | BART, T5, Pegasus |
| Translation | mBART, MarianMT, T5 |
| Generation | GPT-2, T5, BLOOM |
| Named Entity | RoBERTa, LUKE, DeBERTa |
Error Handling
from transformers import AutoTokenizer
# Handle missing models gracefully
try:
tokenizer = AutoTokenizer.from_pretrained("nonexistent-model")
except OSError:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
Final Thoughts
Hugging Face democratized NLP. What took weeks now takes hours. What required expertise now requires pip install.
Start with pipelines for quick prototypes. Move to fine-tuning when you need customization. The Hub has a model for almost anything.
NLP for everyone.