Stable Diffusion: Open Sourcing Creativity

April 26, 2022

ai machine-learning

Stable Diffusion changed everything. High-quality AI image generation, running on consumer hardware, with open weights. Here’s what it means.

What’s Different

Model	Access	Hardware	License
DALL-E 2	API only	OpenAI servers	Closed
Midjourney	Discord bot	Their servers	Closed
Stable Diffusion	Open weights	Your GPU	Open

You can run Stable Diffusion on a gaming GPU. You own the outputs. You can modify the model.

How It Works

Latent Diffusion

Text prompt: "A sunset over mountains, oil painting"
        ↓
    Text encoder (CLIP)
        ↓
    Encoded text embedding
        ↓
    Diffusion process (U-Net)
    [Start with noise] → [Iteratively denoise] → [Latent image]
        ↓
    VAE Decoder
        ↓
    Final image (512x512)

Key insight: Work in a compressed “latent space” instead of pixel space. Much more efficient.

The Diffusion Process

# Simplified diffusion sampling
def sample(model, prompt, steps=50):
    # Start with pure noise
    latent = torch.randn(1, 4, 64, 64)
    
    # Get text embedding
    text_embedding = encode_text(prompt)
    
    # Iteratively denoise
    for t in reversed(range(steps)):
        # Predict noise
        noise_pred = model(latent, t, text_embedding)
        
        # Remove predicted noise
        latent = denoise_step(latent, noise_pred, t)
    
    # Decode to image
    image = decode_latent(latent)
    return image

Running Locally

Requirements

GPU with 8GB+ VRAM (4GB with optimizations)
Python 3.8+
CUDA or MPS (Mac)

Setup

# Clone the repository
git clone https://github.com/CompVis/stable-diffusion
cd stable-diffusion

# Create environment
conda env create -f environment.yaml
conda activate ldm

# Download weights (requires HuggingFace account)
# Place in models/ldm/stable-diffusion-v1/model.ckpt

Basic Generation

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

prompt = "A serene Japanese garden with cherry blossoms, digital art"
image = pipe(prompt).images[0]
image.save("garden.png")

With Parameters

image = pipe(
    prompt="A cyberpunk city at night, neon lights, rain",
    negative_prompt="blurry, low quality, distorted",
    num_inference_steps=50,
    guidance_scale=7.5,
    height=768,
    width=512,
).images[0]

Key Parameters

Guidance Scale (CFG)

How closely to follow the prompt:

Value	Effect
1-3	Creative, may ignore prompt
7-8	Balanced (recommended)
15+	Literal, may look artificial

Steps

More steps = more refined, but diminishing returns:

Steps	Quality	Time
20	Decent	Fast
50	Good	Moderate
100+	Marginal improvement	Slow

Seed

# Reproducible results
generator = torch.Generator("cuda").manual_seed(42)
image = pipe(prompt, generator=generator).images[0]

Advanced Techniques

Img2Img

from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(...)

init_image = Image.open("sketch.png").convert("RGB")

image = pipe(
    prompt="A detailed oil painting of a landscape",
    image=init_image,
    strength=0.75,  # How much to change (0=no change, 1=complete change)
).images[0]

Inpainting

from diffusers import StableDiffusionInpaintPipeline

pipe = StableDiffusionInpaintPipeline.from_pretrained(...)

image = Image.open("photo.png")
mask = Image.open("mask.png")  # White = area to regenerate

result = pipe(
    prompt="A cat sitting on the chair",
    image=image,
    mask_image=mask,
).images[0]

Prompt Engineering

# Basic
"a mountain landscape"

# Better
"a majestic mountain landscape at sunset, dramatic lighting, 
 photorealistic, 8k, trending on artstation"

# Even better (with style references)
"a majestic mountain landscape at sunset, dramatic lighting,
 in the style of Albert Bierstadt, oil painting, 
 high detail, museum quality"

# Negative prompts help too
negative_prompt = "blurry, low quality, cartoon, anime, 
                   watermark, text, signature"

Hardware Optimization

Low VRAM Mode

pipe.enable_attention_slicing()  # Trade speed for memory
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()  # Move unused parts to CPU

Half Precision

# Load in fp16
pipe = StableDiffusionPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    revision="fp16"
)

On Mac (MPS)

pipe = pipe.to("mps")
# May need specific settings for M1/M2

The Ecosystem

UIs

Automatic1111 WebUI: Feature-rich, community favorite
ComfyUI: Node-based, for power users
InvokeAI: Clean, professional feel

Models and Fine-tunes

Stable Diffusion 2.x: Improved, different style
SDXL: Higher resolution, better quality
DreamBooth: Fine-tune on your own images
LoRAs: Lightweight adaptation layers

Ethical Considerations

The Good

Democratizes access to creative tools
Anyone can experiment and learn
Enables new forms of expression

The Concerning

Training data consent unclear
Can mimic specific artists’ styles
Potential for misuse (deepfakes, etc.)

Best Practices

Don’t claim AI art as hand-made
Consider artists whose styles you’re using
Add watermarks/metadata for AI-generated content
Follow platform guidelines

Final Thoughts

Stable Diffusion proved that state-of-the-art AI can be open and accessible. Running on your own hardware means:

Privacy (images stay local)
Control (no content policies)
Learning (study and modify the model)

This is how AI tools should be distributed.

The best creative tool is the one you control.