# Kohya Training

Train LoRA, Dreambooth, and full fine-tunes for Stable Diffusion using Kohya's trainer.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is Kohya?

Kohya\_ss is a training toolkit for:

* **LoRA** - Lightweight adapters (most popular)
* **Dreambooth** - Subject/style training
* **Full fine-tune** - Complete model training
* **LyCORIS** - Advanced LoRA variants

## Requirements

| Training Type     | Min VRAM | Recommended |
| ----------------- | -------- | ----------- |
| LoRA SD 1.5       | 6GB      | RTX 3060    |
| LoRA SDXL         | 12GB     | RTX 3090    |
| Dreambooth SD 1.5 | 12GB     | RTX 3090    |
| Dreambooth SDXL   | 24GB     | RTX 4090    |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7860/http
```

**Command:**

```bash
apt-get update && apt-get install -y git libgl1 libglib2.0-0 && \
cd /workspace && \
git clone https://github.com/bmaltais/kohya_ss.git && \
cd kohya_ss && \
pip install -r requirements.txt && \
pip install xformers && \
python kohya_gui.py --listen 0.0.0.0 --server_port 7860
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Using the Web UI

1. Access at `http://<proxy>:<port>`
2. Select training type (LoRA, Dreambooth, etc.)
3. Configure settings
4. Start training

## Dataset Preparation

### Folder Structure

```
/workspace/dataset/
├── 10_mysubject/           # Repeats_conceptname
│   ├── image1.png
│   ├── image1.txt          # Caption file
│   ├── image2.png
│   └── image2.txt
└── 10_regularization/      # Optional reg images
    ├── reg1.png
    └── reg1.txt
```

### Image Requirements

* **Resolution:** 512x512 (SD 1.5) or 1024x1024 (SDXL)
* **Format:** PNG or JPG
* **Quantity:** 10-50 images for LoRA
* **Quality:** Clear, well-lit, varied angles

### Caption Files

Create `.txt` file with same name as image:

**myimage.txt:**

```
a photo of sks person, professional portrait, studio lighting, high quality
```

### Auto-Captioning

Use BLIP for automatic captions:

```python
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import os

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to("cuda")

for img_file in os.listdir("./images"):
    if img_file.endswith(('.png', '.jpg')):
        image = Image.open(f"./images/{img_file}")
        inputs = processor(image, return_tensors="pt").to("cuda")
        output = model.generate(**inputs, max_new_tokens=50)
        caption = processor.decode(output[0], skip_special_tokens=True)

        txt_file = img_file.rsplit('.', 1)[0] + '.txt'
        with open(f"./images/{txt_file}", 'w') as f:
            f.write(caption)
```

## LoRA Training (SD 1.5)

### Configuration

**In Kohya UI:**

| Setting       | Value                          |
| ------------- | ------------------------------ |
| Model         | runwayml/stable-diffusion-v1-5 |
| Network Rank  | 32-128                         |
| Network Alpha | 16-64                          |
| Learning Rate | 1e-4                           |
| Batch Size    | 1-4                            |
| Epochs        | 10-20                          |
| Optimizer     | AdamW8bit                      |

### Command Line Training

```bash
accelerate launch --num_cpu_threads_per_process=2 train_network.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --train_data_dir="/workspace/dataset" \
    --output_dir="/workspace/output" \
    --output_name="my_lora" \
    --resolution=512 \
    --train_batch_size=1 \
    --max_train_epochs=10 \
    --learning_rate=1e-4 \
    --network_module=networks.lora \
    --network_dim=32 \
    --network_alpha=16 \
    --mixed_precision=fp16 \
    --save_precision=fp16 \
    --optimizer_type=AdamW8bit \
    --lr_scheduler=cosine \
    --cache_latents \
    --xformers \
    --save_every_n_epochs=2
```

## LoRA Training (SDXL)

```bash
accelerate launch train_network.py \
    --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
    --train_data_dir="/workspace/dataset" \
    --output_dir="/workspace/output" \
    --output_name="my_sdxl_lora" \
    --resolution=1024 \
    --train_batch_size=1 \
    --max_train_epochs=10 \
    --learning_rate=1e-4 \
    --network_module=networks.lora \
    --network_dim=32 \
    --network_alpha=16 \
    --mixed_precision=bf16 \
    --save_precision=fp16 \
    --optimizer_type=Adafactor \
    --cache_latents \
    --xformers \
    --save_every_n_epochs=2
```

## Dreambooth Training

### Subject Training

```bash
accelerate launch train_dreambooth.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --instance_data_dir="/workspace/dataset/instance" \
    --class_data_dir="/workspace/dataset/class" \
    --output_dir="/workspace/output" \
    --instance_prompt="a photo of sks person" \
    --class_prompt="a photo of person" \
    --with_prior_preservation \
    --prior_loss_weight=1.0 \
    --num_class_images=200 \
    --resolution=512 \
    --train_batch_size=1 \
    --learning_rate=2e-6 \
    --max_train_steps=1000 \
    --mixed_precision=fp16 \
    --gradient_checkpointing
```

### Style Training

```bash
accelerate launch train_dreambooth.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --instance_data_dir="/workspace/dataset/style" \
    --output_dir="/workspace/output" \
    --instance_prompt="painting in the style of xyz" \
    --resolution=512 \
    --train_batch_size=1 \
    --learning_rate=5e-6 \
    --max_train_steps=2000 \
    --mixed_precision=fp16
```

## Training Tips

### Optimal Settings

| Parameter     | Person/Character | Style | Object |
| ------------- | ---------------- | ----- | ------ |
| Network Rank  | 64-128           | 32-64 | 32     |
| Network Alpha | 32-64            | 16-32 | 16     |
| Learning Rate | 1e-4             | 5e-5  | 1e-4   |
| Epochs        | 15-25            | 10-15 | 10-15  |

### Avoiding Overfitting

* Use regularization images
* Lower learning rate
* Fewer epochs
* Increase network alpha

### Avoiding Underfitting

* More training images
* Higher learning rate
* More epochs
* Lower network alpha

## Monitoring Training

### TensorBoard

```bash
tensorboard --logdir /workspace/output/logs --port 6006 --bind_all
```

### Key Metrics

* **loss** - Should decrease then stabilize
* **lr** - Learning rate schedule
* **epoch** - Training progress

## Testing Your LoRA

### With Automatic1111

Copy LoRA to:

```
stable-diffusion-webui/models/Lora/my_lora.safetensors
```

Use in prompt:

```
<lora:my_lora:0.8> a photo of sks person
```

### With ComfyUI

Load LoRA node and connect to model.

### With Diffusers

```python
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

pipe.load_lora_weights("/workspace/output/my_lora.safetensors")

image = pipe("a photo of sks person, professional portrait").images[0]
```

## Advanced Training

### LyCORIS (LoHa, LoKR)

```bash
accelerate launch train_network.py \
    --network_module=lycoris.kohya \
    --network_args "algo=loha" "conv_dim=4" "conv_alpha=2" \
    ...
```

### Textual Inversion

```bash
accelerate launch train_textual_inversion.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --train_data_dir="/workspace/dataset" \
    --learnable_property="style" \
    --placeholder_token="<my-style>" \
    --initializer_token="art" \
    --resolution=512 \
    --train_batch_size=1 \
    --max_train_steps=3000 \
    --learning_rate=5e-4
```

## Saving & Exporting

### Download Trained Model

```bash
scp -P <port> root@<proxy>:/workspace/output/my_lora.safetensors ./
```

### Convert Formats

```python

# SafeTensors to PyTorch
from safetensors.torch import load_file, save_file
import torch

state_dict = load_file("model.safetensors")
torch.save(state_dict, "model.pt")
```

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## FLUX LoRA Training

Train LoRA adapters for FLUX.1-dev and FLUX.1-schnell — the latest generation of diffusion transformer models with superior quality.

### VRAM Requirements

| Model             | Min VRAM | Recommended GPU |
| ----------------- | -------- | --------------- |
| FLUX.1-schnell    | 16GB     | RTX 4080 / 3090 |
| FLUX.1-dev        | 24GB     | RTX 4090        |
| FLUX.1-dev (bf16) | 40GB+    | A100 40GB       |

> **Note:** FLUX uses DiT (Diffusion Transformer) architecture — training dynamics differ significantly from SD 1.5 / SDXL.

### Installation for FLUX

Install PyTorch with CUDA 12.4 support:

```bash
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install xformers --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
pip install accelerate sentencepiece protobuf
```

### FLUX LoRA Configuration (flux\_lora.toml)

```toml
[general]
shuffle_caption = false
caption_extension = ".txt"
keep_tokens = 1

[datasets]
[[datasets.subsets]]
image_dir = "/workspace/dataset/train"
caption_extension = ".txt"
num_repeats = 5
resolution = [512, 512]

[training]
pretrained_model_name_or_path = "black-forest-labs/FLUX.1-dev"
output_dir = "/workspace/output"
output_name = "my_flux_lora"

# FLUX-specific: use bf16 (NOT fp16 — FLUX requires bf16)
mixed_precision = "bf16"
save_precision = "bf16"
full_bf16 = true

train_batch_size = 1
max_train_epochs = 20
gradient_checkpointing = true
gradient_accumulation_steps = 4

# FLUX LoRA parameters — lower LR than SDXL!
learning_rate = 1e-4
lr_scheduler = "cosine_with_restarts"
lr_warmup_steps = 100

# Network configuration
network_module = "networks.lora_flux"
network_dim = 16           # FLUX: smaller dim works well (16-64)
network_alpha = 16         # Set equal to network_dim

# FLUX-specific options
t5xxl_max_token_length = 512
apply_t5_attn_mask = true

# Optimizer — Adafactor works well for FLUX
optimizer_type = "adafactor"
optimizer_args = ["scale_parameter=False", "relative_step=False", "warmup_init=False"]

# Memory saving
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true

# Sampling during training (optional preview)
sample_every_n_epochs = 5
sample_prompts = "/workspace/sample_prompts.txt"
```

### FLUX LoRA Training Command

```bash
# Single GPU
accelerate launch train_network.py \
    --config_file flux_lora.toml \
    --network_module networks.lora_flux \
    --network_dim 16 \
    --network_alpha 16 \
    --mixed_precision bf16 \
    --full_bf16

# With explicit parameters (no toml)
accelerate launch train_network.py \
    --pretrained_model_name_or_path "black-forest-labs/FLUX.1-dev" \
    --train_data_dir "/workspace/dataset" \
    --output_dir "/workspace/output" \
    --output_name "my_flux_lora" \
    --network_module networks.lora_flux \
    --network_dim 16 \
    --network_alpha 16 \
    --learning_rate 1e-4 \
    --max_train_epochs 20 \
    --train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --mixed_precision bf16 \
    --full_bf16 \
    --optimizer_type adafactor \
    --cache_latents \
    --cache_text_encoder_outputs \
    --t5xxl_max_token_length 512 \
    --apply_t5_attn_mask \
    --save_every_n_epochs 5
```

### FLUX vs SDXL: Key Differences

| Parameter      | SDXL          | FLUX.1              |
| -------------- | ------------- | ------------------- |
| Learning Rate  | 1e-3 to 1e-4  | **1e-4 to 5e-5**    |
| Precision      | fp16 or bf16  | **bf16 REQUIRED**   |
| Network Module | networks.lora | networks.lora\_flux |
| Network Dim    | 32–128        | 8–64 (smaller)      |
| Optimizer      | AdamW8bit     | Adafactor           |
| Min VRAM       | 12GB          | 16–24GB             |
| Architecture   | U-Net         | DiT (Transformer)   |

### Learning Rate Guide for FLUX

```toml
# Conservative (safer, less chance of overfitting)
learning_rate = 5e-5

# Standard (good starting point)
learning_rate = 1e-4

# Aggressive (more expressive, risk of artifacts)
learning_rate = 2e-4
```

> **Tip:** FLUX is more sensitive to learning rate than SDXL. Start at `1e-4` and reduce to `5e-5` if you see quality issues. For SDXL, `1e-3` is common — avoid this for FLUX.

### Testing FLUX LoRA

```python
import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
).to("cuda")

# Load your trained LoRA
pipe.load_lora_weights("/workspace/output/my_flux_lora.safetensors")

image = pipe(
    prompt="a photo of sks person, professional portrait, studio lighting",
    num_inference_steps=28,
    guidance_scale=3.5,
    width=1024,
    height=1024,
).images[0]

image.save("flux_lora_test.png")
```

***

## Troubleshooting

### OOM Error

* Reduce batch size to 1
* Enable gradient checkpointing
* Use 8bit optimizer
* Lower resolution

### Poor Results

* More/better training images
* Adjust learning rate
* Check captions match images
* Try different network rank

### Training Crashes

* Check CUDA version
* Update xformers
* Reduce batch size
* Check disk space

### FLUX-Specific Issues

* **"bf16 not supported"** — Use A-series (Ampere+) or RTX 30/40 series GPUs
* **OOM on FLUX.1-dev** — Switch to FLUX.1-schnell (needs 16GB) or enable `cache_text_encoder_outputs`
* **Blurry results** — Increase `network_dim` to 32–64, lower learning rate to `5e-5`
* **NaN loss** — Disable `full_bf16`, check your dataset for corrupted images
