# FLUX.2 Klein

FLUX.2 Klein by Black Forest Labs is the successor to FLUX.1, delivering the same image quality at **20–60× the speed**. Where FLUX.1 took 10–30 seconds per image, FLUX.2 Klein generates in **under 0.5 seconds** on an RTX 4090. It's a 32B Diffusion Transformer (DiT) model with an Apache 2.0 license, and as of January 2026, it's even experimentally supported in Ollama.

## Key Features

* **< 0.5 second generation**: 20–60× faster than FLUX.1
* **32B DiT architecture**: Same quality as FLUX.1 dev
* **Apache 2.0 license**: Full commercial use
* **Ollama support**: Experimental image generation via Ollama (Jan 2026)
* **ComfyUI compatible**: Drop-in replacement for FLUX.1 workflows
* **LoRA + ControlNet**: Community adapters available

## Requirements

| Component | Minimum                | Recommended   |
| --------- | ---------------------- | ------------- |
| GPU       | RTX 3090 24GB          | RTX 4090 24GB |
| VRAM      | 16GB (with offloading) | 24GB          |
| RAM       | 32GB                   | 64GB          |
| Disk      | 40GB                   | 60GB          |
| CUDA      | 12.0+                  | 12.1+         |

**Recommended Clore.ai GPU**: RTX 4090 24GB (\~$0.5–2/day) — sub-second generation

### Speed Comparison: FLUX.1 vs FLUX.2 Klein

| GPU      | FLUX.1 dev (20 steps) | FLUX.2 Klein | Speedup |
| -------- | --------------------- | ------------ | ------- |
| RTX 3090 | \~25 sec              | \~1.2 sec    | 20×     |
| RTX 4090 | \~12 sec              | \~0.4 sec    | 30×     |
| RTX 5090 | \~8 sec               | \~0.25 sec   | 32×     |
| H100     | \~5 sec               | \~0.15 sec   | 33×     |

## Quick Start with diffusers

```python
import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

# Generate image in < 0.5 seconds!
image = pipe(
    prompt="a cyberpunk GPU mining rig in a neon-lit server room, photorealistic",
    height=1024,
    width=1024,
    num_inference_steps=4,  # Klein needs only 4 steps!
    guidance_scale=3.5,
).images[0]

image.save("output.png")
```

### Memory-Efficient Mode (16GB GPUs)

```python
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein",
    torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()  # Fits on 16GB
pipe.vae.enable_tiling()         # Saves ~2GB

image = pipe("a mountain landscape at sunset", num_inference_steps=4).images[0]
```

## ComfyUI Workflow

FLUX.2 Klein works as a drop-in replacement in existing FLUX.1 ComfyUI workflows:

1. Download the FLUX.2 Klein checkpoint to `ComfyUI/models/diffusion_models/`
2. In your workflow, change the checkpoint node to point to FLUX.2 Klein
3. Reduce steps to 4 (instead of 20–50 for FLUX.1)
4. Set guidance scale to 3.0–4.0

```bash
# Download model for ComfyUI
cd ComfyUI/models/diffusion_models/
wget https://huggingface.co/black-forest-labs/FLUX.2-klein/resolve/main/flux2-klein.safetensors
```

## Batch Generation

With sub-second generation, FLUX.2 Klein enables massive batch processing:

```python
import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein", torch_dtype=torch.bfloat16
).to("cuda")

prompts = [
    "a red sports car on a mountain road, cinematic",
    "a cozy coffee shop interior, warm lighting",
    "an astronaut floating above Earth, hyperrealistic",
    "a medieval castle in autumn, fantasy art",
    # ... add hundreds more
]

for i, prompt in enumerate(prompts):
    image = pipe(prompt, num_inference_steps=4, guidance_scale=3.5).images[0]
    image.save(f"batch_{i:04d}.png")
    print(f"Generated {i+1}/{len(prompts)}")

# On RTX 4090: ~100 images in under 1 minute!
```

## LoRA Support

```python
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein", torch_dtype=torch.bfloat16
).to("cuda")

# Load a LoRA trained on FLUX architecture
pipe.load_lora_weights("your-lora/flux2-style-lora", weight_name="lora.safetensors")
pipe.fuse_lora(lora_scale=0.8)

image = pipe("a portrait in the trained style", num_inference_steps=4).images[0]
```

## Tips for Clore.ai Users

* **Batch processing king**: At 0.4 sec/image, you can generate 10,000+ images per hour on RTX 4090
* **4 steps only**: Don't use more — Klein is optimized for 4 steps (more doesn't improve quality)
* **Same LoRAs as FLUX.1**: Most FLUX.1 LoRAs are compatible with Klein
* **ComfyUI drop-in**: Just swap the checkpoint, change steps to 4
* **RTX 3090 is viable**: 1.2 sec/image is still great at $0.3/day

## Troubleshooting

| Issue                    | Solution                                                                |
| ------------------------ | ----------------------------------------------------------------------- |
| OOM on 24GB              | Use `enable_model_cpu_offload()` + `vae.enable_tiling()`                |
| Blurry images            | Ensure `num_inference_steps=4`, not less. Check guidance\_scale 3.0–4.0 |
| Slow first generation    | Normal — model loads on first call (\~30s). Subsequent: sub-second      |
| ComfyUI checkpoint error | Ensure you have the `.safetensors` file, not the diffusers format       |

## Further Reading

* [FLUX.1 Guide](https://docs.clore.ai/guides/image-generation/flux) — original FLUX guide with LoRA and ControlNet details
* [ComfyUI Guide](https://docs.clore.ai/guides/image-generation/comfyui) — ComfyUI setup and workflows
* [Black Forest Labs Blog](https://blackforestlabs.ai/)
* [HuggingFace Model](https://huggingface.co/black-forest-labs/FLUX.2-klein)
