Kohya Training

Train LoRA and DreamBooth for Stable Diffusion with Kohya on Clore.ai

Train LoRA, Dreambooth, and full fine-tunes for Stable Diffusion using Kohya's trainer.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Renting on CLORE.AI

Visit CLORE.AI Marketplace
Filter by GPU type, VRAM, and price
Choose On-Demand (fixed rate) or Spot (bid price)
Configure your order:
- Select Docker image
- Set ports (TCP for SSH, HTTP for web UIs)
- Add environment variables if needed
- Enter startup command
Select payment: CLORE, BTC, or USDT/USDC
Create order and wait for deployment

Access Your Server

Find connection details in My Orders
Web interfaces: Use the HTTP port URL
SSH: ssh -p <port> root@<proxy-address>

What is Kohya?

Kohya_ss is a training toolkit for:

LoRA - Lightweight adapters (most popular)
Dreambooth - Subject/style training
Full fine-tune - Complete model training
LyCORIS - Advanced LoRA variants

Requirements

Training Type

Min VRAM

Recommended

LoRA SD 1.5

6GB

RTX 3060

LoRA SDXL

12GB

RTX 3090

Dreambooth SD 1.5

12GB

RTX 3090

Dreambooth SDXL

24GB

RTX 4090

Quick Deploy

Docker Image:

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

Ports:

22/tcp
7860/http

Command:

apt-get update && apt-get install -y git libgl1 libglib2.0-0 && \
cd /workspace && \
git clone https://github.com/bmaltais/kohya_ss.git && \
cd kohya_ss && \
pip install -r requirements.txt && \
pip install xformers && \
python kohya_gui.py --listen 0.0.0.0 --server_port 7860

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Using the Web UI

Access at http://<proxy>:<port>
Select training type (LoRA, Dreambooth, etc.)
Configure settings
Start training

Dataset Preparation

Folder Structure

/workspace/dataset/
├── 10_mysubject/           # Repeats_conceptname
│   ├── image1.png
│   ├── image1.txt          # Caption file
│   ├── image2.png
│   └── image2.txt
└── 10_regularization/      # Optional reg images
    ├── reg1.png
    └── reg1.txt

Image Requirements

Resolution: 512x512 (SD 1.5) or 1024x1024 (SDXL)
Format: PNG or JPG
Quantity: 10-50 images for LoRA
Quality: Clear, well-lit, varied angles

Caption Files

Create .txt file with same name as image:

myimage.txt:

a photo of sks person, professional portrait, studio lighting, high quality

Auto-Captioning

Use BLIP for automatic captions:

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import os

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to("cuda")

for img_file in os.listdir("./images"):
    if img_file.endswith(('.png', '.jpg')):
        image = Image.open(f"./images/{img_file}")
        inputs = processor(image, return_tensors="pt").to("cuda")
        output = model.generate(**inputs, max_new_tokens=50)
        caption = processor.decode(output[0], skip_special_tokens=True)

        txt_file = img_file.rsplit('.', 1)[0] + '.txt'
        with open(f"./images/{txt_file}", 'w') as f:
            f.write(caption)

LoRA Training (SD 1.5)

Configuration

In Kohya UI:

Setting

Value

Model

runwayml/stable-diffusion-v1-5

Network Rank

32-128

Network Alpha

16-64

Learning Rate

1e-4

Batch Size

1-4

Epochs

10-20

Optimizer

AdamW8bit

Command Line Training

accelerate launch --num_cpu_threads_per_process=2 train_network.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --train_data_dir="/workspace/dataset" \
    --output_dir="/workspace/output" \
    --output_name="my_lora" \
    --resolution=512 \
    --train_batch_size=1 \
    --max_train_epochs=10 \
    --learning_rate=1e-4 \
    --network_module=networks.lora \
    --network_dim=32 \
    --network_alpha=16 \
    --mixed_precision=fp16 \
    --save_precision=fp16 \
    --optimizer_type=AdamW8bit \
    --lr_scheduler=cosine \
    --cache_latents \
    --xformers \
    --save_every_n_epochs=2

LoRA Training (SDXL)

accelerate launch train_network.py \
    --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
    --train_data_dir="/workspace/dataset" \
    --output_dir="/workspace/output" \
    --output_name="my_sdxl_lora" \
    --resolution=1024 \
    --train_batch_size=1 \
    --max_train_epochs=10 \
    --learning_rate=1e-4 \
    --network_module=networks.lora \
    --network_dim=32 \
    --network_alpha=16 \
    --mixed_precision=bf16 \
    --save_precision=fp16 \
    --optimizer_type=Adafactor \
    --cache_latents \
    --xformers \
    --save_every_n_epochs=2

Dreambooth Training

Subject Training

accelerate launch train_dreambooth.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --instance_data_dir="/workspace/dataset/instance" \
    --class_data_dir="/workspace/dataset/class" \
    --output_dir="/workspace/output" \
    --instance_prompt="a photo of sks person" \
    --class_prompt="a photo of person" \
    --with_prior_preservation \
    --prior_loss_weight=1.0 \
    --num_class_images=200 \
    --resolution=512 \
    --train_batch_size=1 \
    --learning_rate=2e-6 \
    --max_train_steps=1000 \
    --mixed_precision=fp16 \
    --gradient_checkpointing

Style Training

accelerate launch train_dreambooth.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --instance_data_dir="/workspace/dataset/style" \
    --output_dir="/workspace/output" \
    --instance_prompt="painting in the style of xyz" \
    --resolution=512 \
    --train_batch_size=1 \
    --learning_rate=5e-6 \
    --max_train_steps=2000 \
    --mixed_precision=fp16

Training Tips

Optimal Settings

Parameter

Person/Character

Style

Object

Network Rank

64-128

32-64

Network Alpha

32-64

16-32

Learning Rate

1e-4

5e-5

1e-4

Epochs

15-25

10-15

Avoiding Overfitting

Use regularization images
Lower learning rate
Fewer epochs
Increase network alpha

Avoiding Underfitting

More training images
Higher learning rate
More epochs
Lower network alpha

Monitoring Training

TensorBoard

tensorboard --logdir /workspace/output/logs --port 6006 --bind_all

Key Metrics

loss - Should decrease then stabilize
lr - Learning rate schedule
epoch - Training progress

Testing Your LoRA

With Automatic1111

Copy LoRA to:

stable-diffusion-webui/models/Lora/my_lora.safetensors

Use in prompt:

<lora:my_lora:0.8> a photo of sks person

With ComfyUI

Load LoRA node and connect to model.

With Diffusers

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

pipe.load_lora_weights("/workspace/output/my_lora.safetensors")

image = pipe("a photo of sks person, professional portrait").images[0]

Advanced Training

LyCORIS (LoHa, LoKR)

accelerate launch train_network.py \
    --network_module=lycoris.kohya \
    --network_args "algo=loha" "conv_dim=4" "conv_alpha=2" \
    ...

Textual Inversion

accelerate launch train_textual_inversion.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --train_data_dir="/workspace/dataset" \
    --learnable_property="style" \
    --placeholder_token="<my-style>" \
    --initializer_token="art" \
    --resolution=512 \
    --train_batch_size=1 \
    --max_train_steps=3000 \
    --learning_rate=5e-4

Saving & Exporting

Download Trained Model

scp -P <port> root@<proxy>:/workspace/output/my_lora.safetensors ./

Convert Formats


# SafeTensors to PyTorch
from safetensors.torch import load_file, save_file
import torch

state_dict = load_file("model.safetensors")
torch.save(state_dict, "model.pt")

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU

Hourly Rate

Daily Rate

4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot market for flexible workloads (often 30-50% cheaper)
Pay with CLORE tokens
Compare prices across different providers

FLUX LoRA Training

Train LoRA adapters for FLUX.1-dev and FLUX.1-schnell — the latest generation of diffusion transformer models with superior quality.

VRAM Requirements

Model

Min VRAM

Recommended GPU

FLUX.1-schnell

16GB

RTX 4080 / 3090

FLUX.1-dev

24GB

RTX 4090

FLUX.1-dev (bf16)

40GB+

A100 40GB

Note: FLUX uses DiT (Diffusion Transformer) architecture — training dynamics differ significantly from SD 1.5 / SDXL.

Installation for FLUX

Install PyTorch with CUDA 12.4 support:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install xformers --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
pip install accelerate sentencepiece protobuf

FLUX LoRA Configuration (flux_lora.toml)

[general]
shuffle_caption = false
caption_extension = ".txt"
keep_tokens = 1

[datasets]
[[datasets.subsets]]
image_dir = "/workspace/dataset/train"
caption_extension = ".txt"
num_repeats = 5
resolution = [512, 512]

[training]
pretrained_model_name_or_path = "black-forest-labs/FLUX.1-dev"
output_dir = "/workspace/output"
output_name = "my_flux_lora"

# FLUX-specific: use bf16 (NOT fp16 — FLUX requires bf16)
mixed_precision = "bf16"
save_precision = "bf16"
full_bf16 = true

train_batch_size = 1
max_train_epochs = 20
gradient_checkpointing = true
gradient_accumulation_steps = 4

# FLUX LoRA parameters — lower LR than SDXL!
learning_rate = 1e-4
lr_scheduler = "cosine_with_restarts"
lr_warmup_steps = 100

# Network configuration
network_module = "networks.lora_flux"
network_dim = 16           # FLUX: smaller dim works well (16-64)
network_alpha = 16         # Set equal to network_dim

# FLUX-specific options
t5xxl_max_token_length = 512
apply_t5_attn_mask = true

# Optimizer — Adafactor works well for FLUX
optimizer_type = "adafactor"
optimizer_args = ["scale_parameter=False", "relative_step=False", "warmup_init=False"]

# Memory saving
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true

# Sampling during training (optional preview)
sample_every_n_epochs = 5
sample_prompts = "/workspace/sample_prompts.txt"

FLUX LoRA Training Command

# Single GPU
accelerate launch train_network.py \
    --config_file flux_lora.toml \
    --network_module networks.lora_flux \
    --network_dim 16 \
    --network_alpha 16 \
    --mixed_precision bf16 \
    --full_bf16

# With explicit parameters (no toml)
accelerate launch train_network.py \
    --pretrained_model_name_or_path "black-forest-labs/FLUX.1-dev" \
    --train_data_dir "/workspace/dataset" \
    --output_dir "/workspace/output" \
    --output_name "my_flux_lora" \
    --network_module networks.lora_flux \
    --network_dim 16 \
    --network_alpha 16 \
    --learning_rate 1e-4 \
    --max_train_epochs 20 \
    --train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --mixed_precision bf16 \
    --full_bf16 \
    --optimizer_type adafactor \
    --cache_latents \
    --cache_text_encoder_outputs \
    --t5xxl_max_token_length 512 \
    --apply_t5_attn_mask \
    --save_every_n_epochs 5

FLUX vs SDXL: Key Differences

Parameter

SDXL

FLUX.1

Learning Rate

1e-3 to 1e-4

1e-4 to 5e-5

Precision

fp16 or bf16

bf16 REQUIRED

Network Module

networks.lora

networks.lora_flux

Network Dim

32–128

8–64 (smaller)

Optimizer

AdamW8bit

Adafactor

Min VRAM

12GB

16–24GB

Architecture

U-Net

DiT (Transformer)

Learning Rate Guide for FLUX

# Conservative (safer, less chance of overfitting)
learning_rate = 5e-5

# Standard (good starting point)
learning_rate = 1e-4

# Aggressive (more expressive, risk of artifacts)
learning_rate = 2e-4

Tip: FLUX is more sensitive to learning rate than SDXL. Start at 1e-4 and reduce to 5e-5 if you see quality issues. For SDXL, 1e-3 is common — avoid this for FLUX.

Testing FLUX LoRA

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
).to("cuda")

# Load your trained LoRA
pipe.load_lora_weights("/workspace/output/my_flux_lora.safetensors")

image = pipe(
    prompt="a photo of sks person, professional portrait, studio lighting",
    num_inference_steps=28,
    guidance_scale=3.5,
    width=1024,
    height=1024,
).images[0]

image.save("flux_lora_test.png")

Troubleshooting

OOM Error

Reduce batch size to 1
Enable gradient checkpointing
Use 8bit optimizer
Lower resolution

Poor Results

More/better training images
Adjust learning rate
Check captions match images
Try different network rank

Training Crashes

Check CUDA version
Update xformers
Reduce batch size
Check disk space

FLUX-Specific Issues

"bf16 not supported" — Use A-series (Ampere+) or RTX 30/40 series GPUs
OOM on FLUX.1-dev — Switch to FLUX.1-schnell (needs 16GB) or enable cache_text_encoder_outputs
Blurry results — Increase network_dim to 32–64, lower learning rate to 5e-5
NaN loss — Disable full_bf16, check your dataset for corrupted images

PreviousDreamBooth NextFine-tune LLM

Last updated 6 days ago

Was this helpful?

hashtagRenting on CLORE.AI

hashtagAccess Your Server

hashtagWhat is Kohya?

hashtagRequirements

hashtagQuick Deploy

hashtagAccessing Your Service

hashtagUsing the Web UI

hashtagDataset Preparation

hashtagFolder Structure

hashtagImage Requirements

hashtagCaption Files

hashtagAuto-Captioning

hashtagLoRA Training (SD 1.5)

hashtagConfiguration

hashtagCommand Line Training

hashtagLoRA Training (SDXL)

hashtagDreambooth Training

hashtagSubject Training

hashtagStyle Training

hashtagTraining Tips

hashtagOptimal Settings

hashtagAvoiding Overfitting

hashtagAvoiding Underfitting

hashtagMonitoring Training

hashtagTensorBoard

hashtagKey Metrics

hashtagTesting Your LoRA

hashtagWith Automatic1111

hashtagWith ComfyUI

hashtagWith Diffusers

hashtagAdvanced Training

hashtagLyCORIS (LoHa, LoKR)

hashtagTextual Inversion

hashtagSaving & Exporting

hashtagDownload Trained Model

hashtagConvert Formats

hashtagCost Estimate

hashtagFLUX LoRA Training

hashtagVRAM Requirements

hashtagInstallation for FLUX

hashtagFLUX LoRA Configuration (flux_lora.toml)

hashtagFLUX LoRA Training Command

hashtagFLUX vs SDXL: Key Differences

hashtagLearning Rate Guide for FLUX

hashtagTesting FLUX LoRA

hashtagTroubleshooting

hashtagOOM Error

hashtagPoor Results

hashtagTraining Crashes

hashtagFLUX-Specific Issues

Renting on CLORE.AI

Access Your Server

What is Kohya?

Requirements

Quick Deploy

Accessing Your Service

Using the Web UI

Dataset Preparation

Folder Structure

Image Requirements

Caption Files

Auto-Captioning

LoRA Training (SD 1.5)

Configuration

Command Line Training

LoRA Training (SDXL)

Dreambooth Training

Subject Training

Style Training

Training Tips

Optimal Settings

Avoiding Overfitting

Avoiding Underfitting

Monitoring Training

TensorBoard

Key Metrics

Testing Your LoRA

With Automatic1111

With ComfyUI

With Diffusers

Advanced Training

LyCORIS (LoHa, LoKR)

Textual Inversion

Saving & Exporting

Download Trained Model

Convert Formats

Cost Estimate

FLUX LoRA Training

VRAM Requirements

Installation for FLUX

FLUX LoRA Configuration (flux_lora.toml)

FLUX LoRA Training Command

FLUX vs SDXL: Key Differences

Learning Rate Guide for FLUX

Testing FLUX LoRA

Troubleshooting

OOM Error

Poor Results

Training Crashes

FLUX-Specific Issues