Introduction to MLOps: DevOps for AI
ai machine-learning mlops devops
A model that works in a notebook is a prototype. A model that works in production is a product. The gap between them is MLOps.
The ML Lifecycle
Data → Prepare → Train → Evaluate → Deploy → Monitor → Repeat
↑ │
└──────────────────────────────────────────────┘
Every step needs automation, versioning, and monitoring.
Why MLOps is Different
Code vs Model Versioning
Traditional:
git commit -m "Updated payment logic"
ML:
- Model version
- Training data version
- Feature engineering code
- Hyperparameters
- Training environment
All must be reproducible.
Testing is Harder
Traditional: Unit tests, integration tests.
ML:
- Data quality tests
- Model performance tests
- Bias/fairness tests
- A/B tests in production
Monitoring is Different
Traditional: Uptime, latency, errors.
ML: Add model-specific:
- Prediction distribution (drift)
- Feature distribution (data drift)
- Model performance (degradation)
Core MLOps Components
Data Versioning
# DVC (Data Version Control)
dvc init
dvc add data/training.csv
git add data/training.csv.dvc
git commit -m "Add training data v1"
# Track data alongside code
dvc push # Push to remote storage
Experiment Tracking
import mlflow
mlflow.set_experiment("fraud_detection")
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("epochs", 100)
model = train(...)
mlflow.log_metric("accuracy", 0.95)
mlflow.log_metric("f1_score", 0.87)
mlflow.sklearn.log_model(model, "model")
Feature Stores
Central repository for features:
from feast import FeatureStore
store = FeatureStore(repo_path=".")
# Get features for training
training_df = store.get_historical_features(
entity_df=entity_df,
features=["user_features:age", "user_features:purchase_count"]
)
# Get features for inference (real-time)
online_features = store.get_online_features(
features=["user_features:age", "user_features:purchase_count"],
entity_rows=[{"user_id": 123}]
)
Model Registry
import mlflow
# Register model
mlflow.register_model(
"runs:/abc123/model",
"fraud_detector"
)
# Transition stages
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
name="fraud_detector",
version=3,
stage="Production"
)
The Pipeline
# Example: Kubeflow Pipeline
stages:
- name: data-prep
container: preprocess:v1
inputs: [raw_data]
outputs: [cleaned_data]
- name: train
container: train:v1
inputs: [cleaned_data, params]
outputs: [model]
- name: evaluate
container: evaluate:v1
inputs: [model, test_data]
outputs: [metrics]
- name: deploy
container: deploy:v1
inputs: [model]
condition: metrics.accuracy > 0.90
Monitoring in Production
Data Drift
from evidently.metrics import DataDriftPreset
from evidently.report import Report
report = Report(metrics=[DataDriftPreset()])
report.run(
reference_data=training_data,
current_data=production_data
)
# Alert if drift detected
if report.as_dict()['metrics'][0]['result']['dataset_drift']:
alert("Data drift detected!")
Model Performance
# Log predictions and ground truth
def predict_and_log(features, model):
prediction = model.predict(features)
# Log for later evaluation
mlflow.log_metric("prediction_count", 1)
store_for_ground_truth_comparison(features, prediction)
return prediction
# Scheduled evaluation
def evaluate_recent_predictions():
predictions = get_recent_predictions()
ground_truth = get_ground_truth(predictions)
accuracy = calculate_accuracy(predictions, ground_truth)
if accuracy < threshold:
trigger_retraining()
Tool Landscape
End-to-End Platforms
- Vertex AI (Google): Full managed MLOps
- SageMaker (AWS): Training to deployment
- Azure ML: Microsoft ecosystem
Open Source Stack
| Component | Tools |
|---|---|
| Data versioning | DVC, LakeFS |
| Experiment tracking | MLflow, Weights & Biases |
| Feature store | Feast, Tecton |
| Pipeline orchestration | Kubeflow, Airflow |
| Model serving | Seldon, KServe, MLflow |
| Monitoring | Evidently, Arize, WhyLabs |
Getting Started
Level 0: Manual
- Training in notebooks
- Manual deployment
- No versioning
Level 1: ML Pipeline
- Automated training pipeline
- Model versioning
- Basic monitoring
Level 2: CI/CD for ML
- Automated retraining on new data
- Automated deployment with gates
- A/B testing
Level 3: Full Automation
- Continuous training
- Continuous monitoring
- Automatic drift detection and retraining
Best Practices
- Version everything: Data, code, models, configs
- Reproducible training: Same inputs → same model
- Test data quality: Garbage in, garbage out
- Monitor relentlessly: Models degrade silently
- Automate carefully: Understand before automating
Final Thoughts
MLOps bridges the gap between data science and production. Without it, models rot in notebooks.
Start simple: version data, track experiments, monitor predictions. Add automation as you mature.
The goal is reliable, maintainable ML systems. Not just working models.
Models don’t ship themselves. MLOps does.