# Kohya Training

Train LoRA, Dreambooth, and full fine-tunes for Stable Diffusion using Kohya's trainer.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is Kohya?

Kohya\_ss is a training toolkit for:

* **LoRA** - Lightweight adapters (most popular)
* **Dreambooth** - Subject/style training
* **Full fine-tune** - Complete model training
* **LyCORIS** - Advanced LoRA variants

## Requirements

| Training Type     | Min VRAM | Recommended |
| ----------------- | -------- | ----------- |
| LoRA SD 1.5       | 6GB      | RTX 3060    |
| LoRA SDXL         | 12GB     | RTX 3090    |
| Dreambooth SD 1.5 | 12GB     | RTX 3090    |
| Dreambooth SDXL   | 24GB     | RTX 4090    |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7860/http
```

**Command:**

```bash
apt-get update && apt-get install -y git libgl1 libglib2.0-0 && \
cd /workspace && \
git clone https://github.com/bmaltais/kohya_ss.git && \
cd kohya_ss && \
pip install -r requirements.txt && \
pip install xformers && \
python kohya_gui.py --listen 0.0.0.0 --server_port 7860
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Using the Web UI

1. Access at `http://<proxy>:<port>`
2. Select training type (LoRA, Dreambooth, etc.)
3. Configure settings
4. Start training

## Dataset Preparation

### Folder Structure

```
/workspace/dataset/
├── 10_mysubject/           # Repeats_conceptname
│   ├── image1.png
│   ├── image1.txt          # Caption file
│   ├── image2.png
│   └── image2.txt
└── 10_regularization/      # Optional reg images
    ├── reg1.png
    └── reg1.txt
```

### Image Requirements

* **Resolution:** 512x512 (SD 1.5) or 1024x1024 (SDXL)
* **Format:** PNG or JPG
* **Quantity:** 10-50 images for LoRA
* **Quality:** Clear, well-lit, varied angles

### Caption Files

Create `.txt` file with same name as image:

**myimage.txt:**

```
a photo of sks person, professional portrait, studio lighting, high quality
```

### Auto-Captioning

Use BLIP for automatic captions:

```python
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import os

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to("cuda")

for img_file in os.listdir("./images"):
    if img_file.endswith(('.png', '.jpg')):
        image = Image.open(f"./images/{img_file}")
        inputs = processor(image, return_tensors="pt").to("cuda")
        output = model.generate(**inputs, max_new_tokens=50)
        caption = processor.decode(output[0], skip_special_tokens=True)

        txt_file = img_file.rsplit('.', 1)[0] + '.txt'
        with open(f"./images/{txt_file}", 'w') as f:
            f.write(caption)
```

## LoRA Training (SD 1.5)

### Configuration

**In Kohya UI:**

| Setting       | Value                          |
| ------------- | ------------------------------ |
| Model         | runwayml/stable-diffusion-v1-5 |
| Network Rank  | 32-128                         |
| Network Alpha | 16-64                          |
| Learning Rate | 1e-4                           |
| Batch Size    | 1-4                            |
| Epochs        | 10-20                          |
| Optimizer     | AdamW8bit                      |

### Command Line Training

```bash
accelerate launch --num_cpu_threads_per_process=2 train_network.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --train_data_dir="/workspace/dataset" \
    --output_dir="/workspace/output" \
    --output_name="my_lora" \
    --resolution=512 \
    --train_batch_size=1 \
    --max_train_epochs=10 \
    --learning_rate=1e-4 \
    --network_module=networks.lora \
    --network_dim=32 \
    --network_alpha=16 \
    --mixed_precision=fp16 \
    --save_precision=fp16 \
    --optimizer_type=AdamW8bit \
    --lr_scheduler=cosine \
    --cache_latents \
    --xformers \
    --save_every_n_epochs=2
```

## LoRA Training (SDXL)

```bash
accelerate launch train_network.py \
    --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
    --train_data_dir="/workspace/dataset" \
    --output_dir="/workspace/output" \
    --output_name="my_sdxl_lora" \
    --resolution=1024 \
    --train_batch_size=1 \
    --max_train_epochs=10 \
    --learning_rate=1e-4 \
    --network_module=networks.lora \
    --network_dim=32 \
    --network_alpha=16 \
    --mixed_precision=bf16 \
    --save_precision=fp16 \
    --optimizer_type=Adafactor \
    --cache_latents \
    --xformers \
    --save_every_n_epochs=2
```

## Dreambooth Training

### Subject Training

```bash
accelerate launch train_dreambooth.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --instance_data_dir="/workspace/dataset/instance" \
    --class_data_dir="/workspace/dataset/class" \
    --output_dir="/workspace/output" \
    --instance_prompt="a photo of sks person" \
    --class_prompt="a photo of person" \
    --with_prior_preservation \
    --prior_loss_weight=1.0 \
    --num_class_images=200 \
    --resolution=512 \
    --train_batch_size=1 \
    --learning_rate=2e-6 \
    --max_train_steps=1000 \
    --mixed_precision=fp16 \
    --gradient_checkpointing
```

### Style Training

```bash
accelerate launch train_dreambooth.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --instance_data_dir="/workspace/dataset/style" \
    --output_dir="/workspace/output" \
    --instance_prompt="painting in the style of xyz" \
    --resolution=512 \
    --train_batch_size=1 \
    --learning_rate=5e-6 \
    --max_train_steps=2000 \
    --mixed_precision=fp16
```

## Training Tips

### Optimal Settings

| Parameter     | Person/Character | Style | Object |
| ------------- | ---------------- | ----- | ------ |
| Network Rank  | 64-128           | 32-64 | 32     |
| Network Alpha | 32-64            | 16-32 | 16     |
| Learning Rate | 1e-4             | 5e-5  | 1e-4   |
| Epochs        | 15-25            | 10-15 | 10-15  |

### Avoiding Overfitting

* Use regularization images
* Lower learning rate
* Fewer epochs
* Increase network alpha

### Avoiding Underfitting

* More training images
* Higher learning rate
* More epochs
* Lower network alpha

## Monitoring Training

### TensorBoard

```bash
tensorboard --logdir /workspace/output/logs --port 6006 --bind_all
```

### Key Metrics

* **loss** - Should decrease then stabilize
* **lr** - Learning rate schedule
* **epoch** - Training progress

## Testing Your LoRA

### With Automatic1111

Copy LoRA to:

```
stable-diffusion-webui/models/Lora/my_lora.safetensors
```

Use in prompt:

```
<lora:my_lora:0.8> a photo of sks person
```

### With ComfyUI

Load LoRA node and connect to model.

### With Diffusers

```python
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

pipe.load_lora_weights("/workspace/output/my_lora.safetensors")

image = pipe("a photo of sks person, professional portrait").images[0]
```

## Advanced Training

### LyCORIS (LoHa, LoKR)

```bash
accelerate launch train_network.py \
    --network_module=lycoris.kohya \
    --network_args "algo=loha" "conv_dim=4" "conv_alpha=2" \
    ...
```

### Textual Inversion

```bash
accelerate launch train_textual_inversion.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --train_data_dir="/workspace/dataset" \
    --learnable_property="style" \
    --placeholder_token="<my-style>" \
    --initializer_token="art" \
    --resolution=512 \
    --train_batch_size=1 \
    --max_train_steps=3000 \
    --learning_rate=5e-4
```

## Saving & Exporting

### Download Trained Model

```bash
scp -P <port> root@<proxy>:/workspace/output/my_lora.safetensors ./
```

### Convert Formats

```python

# SafeTensors to PyTorch
from safetensors.torch import load_file, save_file
import torch

state_dict = load_file("model.safetensors")
torch.save(state_dict, "model.pt")
```

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## FLUX LoRA Training

Train LoRA adapters for FLUX.1-dev and FLUX.1-schnell — the latest generation of diffusion transformer models with superior quality.

### VRAM Requirements

| Model             | Min VRAM | Recommended GPU |
| ----------------- | -------- | --------------- |
| FLUX.1-schnell    | 16GB     | RTX 4080 / 3090 |
| FLUX.1-dev        | 24GB     | RTX 4090        |
| FLUX.1-dev (bf16) | 40GB+    | A100 40GB       |

> **Note:** FLUX uses DiT (Diffusion Transformer) architecture — training dynamics differ significantly from SD 1.5 / SDXL.

### Installation for FLUX

Install PyTorch with CUDA 12.4 support:

```bash
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install xformers --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
pip install accelerate sentencepiece protobuf
```

### FLUX LoRA Configuration (flux\_lora.toml)

```toml
[general]
shuffle_caption = false
caption_extension = ".txt"
keep_tokens = 1

[datasets]
[[datasets.subsets]]
image_dir = "/workspace/dataset/train"
caption_extension = ".txt"
num_repeats = 5
resolution = [512, 512]

[training]
pretrained_model_name_or_path = "black-forest-labs/FLUX.1-dev"
output_dir = "/workspace/output"
output_name = "my_flux_lora"

# FLUX-specific: use bf16 (NOT fp16 — FLUX requires bf16)
mixed_precision = "bf16"
save_precision = "bf16"
full_bf16 = true

train_batch_size = 1
max_train_epochs = 20
gradient_checkpointing = true
gradient_accumulation_steps = 4

# FLUX LoRA parameters — lower LR than SDXL!
learning_rate = 1e-4
lr_scheduler = "cosine_with_restarts"
lr_warmup_steps = 100

# Network configuration
network_module = "networks.lora_flux"
network_dim = 16           # FLUX: smaller dim works well (16-64)
network_alpha = 16         # Set equal to network_dim

# FLUX-specific options
t5xxl_max_token_length = 512
apply_t5_attn_mask = true

# Optimizer — Adafactor works well for FLUX
optimizer_type = "adafactor"
optimizer_args = ["scale_parameter=False", "relative_step=False", "warmup_init=False"]

# Memory saving
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true

# Sampling during training (optional preview)
sample_every_n_epochs = 5
sample_prompts = "/workspace/sample_prompts.txt"
```

### FLUX LoRA Training Command

```bash
# Single GPU
accelerate launch train_network.py \
    --config_file flux_lora.toml \
    --network_module networks.lora_flux \
    --network_dim 16 \
    --network_alpha 16 \
    --mixed_precision bf16 \
    --full_bf16

# With explicit parameters (no toml)
accelerate launch train_network.py \
    --pretrained_model_name_or_path "black-forest-labs/FLUX.1-dev" \
    --train_data_dir "/workspace/dataset" \
    --output_dir "/workspace/output" \
    --output_name "my_flux_lora" \
    --network_module networks.lora_flux \
    --network_dim 16 \
    --network_alpha 16 \
    --learning_rate 1e-4 \
    --max_train_epochs 20 \
    --train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --mixed_precision bf16 \
    --full_bf16 \
    --optimizer_type adafactor \
    --cache_latents \
    --cache_text_encoder_outputs \
    --t5xxl_max_token_length 512 \
    --apply_t5_attn_mask \
    --save_every_n_epochs 5
```

### FLUX vs SDXL: Key Differences

| Parameter      | SDXL          | FLUX.1              |
| -------------- | ------------- | ------------------- |
| Learning Rate  | 1e-3 to 1e-4  | **1e-4 to 5e-5**    |
| Precision      | fp16 or bf16  | **bf16 REQUIRED**   |
| Network Module | networks.lora | networks.lora\_flux |
| Network Dim    | 32–128        | 8–64 (smaller)      |
| Optimizer      | AdamW8bit     | Adafactor           |
| Min VRAM       | 12GB          | 16–24GB             |
| Architecture   | U-Net         | DiT (Transformer)   |

### Learning Rate Guide for FLUX

```toml
# Conservative (safer, less chance of overfitting)
learning_rate = 5e-5

# Standard (good starting point)
learning_rate = 1e-4

# Aggressive (more expressive, risk of artifacts)
learning_rate = 2e-4
```

> **Tip:** FLUX is more sensitive to learning rate than SDXL. Start at `1e-4` and reduce to `5e-5` if you see quality issues. For SDXL, `1e-3` is common — avoid this for FLUX.

### Testing FLUX LoRA

```python
import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
).to("cuda")

# Load your trained LoRA
pipe.load_lora_weights("/workspace/output/my_flux_lora.safetensors")

image = pipe(
    prompt="a photo of sks person, professional portrait, studio lighting",
    num_inference_steps=28,
    guidance_scale=3.5,
    width=1024,
    height=1024,
).images[0]

image.save("flux_lora_test.png")
```

***

## Troubleshooting

### OOM Error

* Reduce batch size to 1
* Enable gradient checkpointing
* Use 8bit optimizer
* Lower resolution

### Poor Results

* More/better training images
* Adjust learning rate
* Check captions match images
* Try different network rank

### Training Crashes

* Check CUDA version
* Update xformers
* Reduce batch size
* Check disk space

### FLUX-Specific Issues

* **"bf16 not supported"** — Use A-series (Ampere+) or RTX 30/40 series GPUs
* **OOM on FLUX.1-dev** — Switch to FLUX.1-schnell (needs 16GB) or enable `cache_text_encoder_outputs`
* **Blurry results** — Increase `network_dim` to 32–64, lower learning rate to `5e-5`
* **NaN loss** — Disable `full_bf16`, check your dataset for corrupted images


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/training/kohya-training.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
