Stable Diffusion: Open Sourcing Creativity
ai machine-learning
Stable Diffusion changed everything. High-quality AI image generation, running on consumer hardware, with open weights. Here’s what it means.
What’s Different
| Model | Access | Hardware | License |
|---|---|---|---|
| DALL-E 2 | API only | OpenAI servers | Closed |
| Midjourney | Discord bot | Their servers | Closed |
| Stable Diffusion | Open weights | Your GPU | Open |
You can run Stable Diffusion on a gaming GPU. You own the outputs. You can modify the model.
How It Works
Latent Diffusion
Text prompt: "A sunset over mountains, oil painting"
↓
Text encoder (CLIP)
↓
Encoded text embedding
↓
Diffusion process (U-Net)
[Start with noise] → [Iteratively denoise] → [Latent image]
↓
VAE Decoder
↓
Final image (512x512)
Key insight: Work in a compressed “latent space” instead of pixel space. Much more efficient.
The Diffusion Process
# Simplified diffusion sampling
def sample(model, prompt, steps=50):
# Start with pure noise
latent = torch.randn(1, 4, 64, 64)
# Get text embedding
text_embedding = encode_text(prompt)
# Iteratively denoise
for t in reversed(range(steps)):
# Predict noise
noise_pred = model(latent, t, text_embedding)
# Remove predicted noise
latent = denoise_step(latent, noise_pred, t)
# Decode to image
image = decode_latent(latent)
return image
Running Locally
Requirements
- GPU with 8GB+ VRAM (4GB with optimizations)
- Python 3.8+
- CUDA or MPS (Mac)
Setup
# Clone the repository
git clone https://github.com/CompVis/stable-diffusion
cd stable-diffusion
# Create environment
conda env create -f environment.yaml
conda activate ldm
# Download weights (requires HuggingFace account)
# Place in models/ldm/stable-diffusion-v1/model.ckpt
Basic Generation
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
prompt = "A serene Japanese garden with cherry blossoms, digital art"
image = pipe(prompt).images[0]
image.save("garden.png")
With Parameters
image = pipe(
prompt="A cyberpunk city at night, neon lights, rain",
negative_prompt="blurry, low quality, distorted",
num_inference_steps=50,
guidance_scale=7.5,
height=768,
width=512,
).images[0]
Key Parameters
Guidance Scale (CFG)
How closely to follow the prompt:
| Value | Effect |
|---|---|
| 1-3 | Creative, may ignore prompt |
| 7-8 | Balanced (recommended) |
| 15+ | Literal, may look artificial |
Steps
More steps = more refined, but diminishing returns:
| Steps | Quality | Time |
|---|---|---|
| 20 | Decent | Fast |
| 50 | Good | Moderate |
| 100+ | Marginal improvement | Slow |
Seed
# Reproducible results
generator = torch.Generator("cuda").manual_seed(42)
image = pipe(prompt, generator=generator).images[0]
Advanced Techniques
Img2Img
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(...)
init_image = Image.open("sketch.png").convert("RGB")
image = pipe(
prompt="A detailed oil painting of a landscape",
image=init_image,
strength=0.75, # How much to change (0=no change, 1=complete change)
).images[0]
Inpainting
from diffusers import StableDiffusionInpaintPipeline
pipe = StableDiffusionInpaintPipeline.from_pretrained(...)
image = Image.open("photo.png")
mask = Image.open("mask.png") # White = area to regenerate
result = pipe(
prompt="A cat sitting on the chair",
image=image,
mask_image=mask,
).images[0]
Prompt Engineering
# Basic
"a mountain landscape"
# Better
"a majestic mountain landscape at sunset, dramatic lighting,
photorealistic, 8k, trending on artstation"
# Even better (with style references)
"a majestic mountain landscape at sunset, dramatic lighting,
in the style of Albert Bierstadt, oil painting,
high detail, museum quality"
# Negative prompts help too
negative_prompt = "blurry, low quality, cartoon, anime,
watermark, text, signature"
Hardware Optimization
Low VRAM Mode
pipe.enable_attention_slicing() # Trade speed for memory
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload() # Move unused parts to CPU
Half Precision
# Load in fp16
pipe = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
revision="fp16"
)
On Mac (MPS)
pipe = pipe.to("mps")
# May need specific settings for M1/M2
The Ecosystem
UIs
- Automatic1111 WebUI: Feature-rich, community favorite
- ComfyUI: Node-based, for power users
- InvokeAI: Clean, professional feel
Models and Fine-tunes
- Stable Diffusion 2.x: Improved, different style
- SDXL: Higher resolution, better quality
- DreamBooth: Fine-tune on your own images
- LoRAs: Lightweight adaptation layers
Ethical Considerations
The Good
- Democratizes access to creative tools
- Anyone can experiment and learn
- Enables new forms of expression
The Concerning
- Training data consent unclear
- Can mimic specific artists’ styles
- Potential for misuse (deepfakes, etc.)
Best Practices
- Don’t claim AI art as hand-made
- Consider artists whose styles you’re using
- Add watermarks/metadata for AI-generated content
- Follow platform guidelines
Final Thoughts
Stable Diffusion proved that state-of-the-art AI can be open and accessible. Running on your own hardware means:
- Privacy (images stay local)
- Control (no content policies)
- Learning (study and modify the model)
This is how AI tools should be distributed.
The best creative tool is the one you control.