# Text Generation WebUI

Run the most popular LLM interface with support for all model formats.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## Why Text Generation WebUI?

* Supports GGUF, GPTQ, AWQ, EXL2, HF formats
* Built-in chat, notebook, and API modes
* Extensions: voice, characters, multimodal
* Fine-tuning support
* Model switching on the fly

## Requirements

| Model Size | Min VRAM | Recommended |
| ---------- | -------- | ----------- |
| 7B (Q4)    | 6GB      | RTX 3060    |
| 13B (Q4)   | 10GB     | RTX 3080    |
| 30B (Q4)   | 20GB     | RTX 4090    |
| 70B (Q4)   | 40GB     | A100        |

## Quick Deploy

**Docker Image:**

```
atinoda/text-generation-webui:default-nvidia
```

**Ports:**

```
22/tcp
7860/http
5000/http
5005/http
```

**Environment:**

```
EXTRA_LAUNCH_ARGS=--listen --api
```

## Manual Installation

**Image:**

```
nvidia/cuda:12.1.0-devel-ubuntu22.04
```

**Ports:**

```
22/tcp
7860/http
5000/http
```

**Command:**

```bash
apt-get update && apt-get install -y git python3 python3-pip && \
cd /workspace && \
git clone https://github.com/oobabooga/text-generation-webui.git && \
cd text-generation-webui && \
pip install -r requirements.txt && \
python server.py --listen --api
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Access WebUI

1. Wait for deployment
2. Find port 7860 mapping in **My Orders**
3. Open: `http://<proxy>:<port>`

## Download Models

### From HuggingFace (in WebUI)

1. Go to **Model** tab
2. Enter model name: `bartowski/Meta-Llama-3.1-8B-Instruct-GGUF`
3. Click **Download**

### Via Command Line

```bash
cd /workspace/text-generation-webui

# Download GGUF model
python download-model.py bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

# Download specific file
python download-model.py bartowski/Meta-Llama-3.1-8B-Instruct-GGUF --specific-file Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
```

### Recommended Models

**For Chat:**

```bash

# Llama 2 Chat (7B, fast)
python download-model.py bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

# Mistral Instruct (excellent)
python download-model.py bartowski/Mistral-7B-Instruct-v0.3-GGUF

# OpenHermes (great all-rounder)
python download-model.py bartowski/OpenHermes-2.5-Mistral-7B-GGUF
```

**For Coding:**

```bash

# CodeLlama
python download-model.py bartowski/CodeLlama-13B-Instruct-GGUF

# DeepSeek Coder
python download-model.py bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF
```

**For Roleplay:**

```bash

# MythoMax
python download-model.py bartowski/MythoMax-L2-13B-GGUF
```

## Loading Models

### GGUF (Recommended for most users)

1. **Model** tab → Select model folder
2. **Model loader:** llama.cpp
3. Set **n-gpu-layers:**
   * RTX 3090: 35-40
   * RTX 4090: 45-50
   * A100: 80+
4. Click **Load**

### GPTQ (Fast, quantized)

1. Download GPTQ model
2. **Model loader:** ExLlama\_HF or AutoGPTQ
3. Load model

### EXL2 (Best speed)

```bash

# Install exllamav2
pip install exllamav2
```

1. Download EXL2 model
2. **Model loader:** ExLlamav2\_HF
3. Load

## Chat Configuration

### Character Setup

1. Go to **Parameters** → **Character**
2. Create or load character card
3. Set:
   * Name
   * Context/persona
   * Example dialogue

### Instruct Mode

For instruction-tuned models:

1. **Parameters** → **Instruction template**
2. Select template matching your model:
   * Llama-2-chat
   * Mistral
   * ChatML
   * Alpaca

## API Usage

### Enable API

Start with `--api` flag (default port 5000)

### OpenAI-compatible API

```python
import openai

openai.api_base = "http://localhost:5000/v1"
openai.api_key = "not-needed"

response = openai.ChatCompletion.create(
    model="any",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
```

### Native API

```python
import requests

response = requests.post(
    "http://localhost:5000/api/v1/generate",
    json={
        "prompt": "Write a story about",
        "max_new_tokens": 200,
        "temperature": 0.7
    }
)
print(response.json()["results"][0]["text"])
```

## Extensions

### Installing Extensions

```bash
cd /workspace/text-generation-webui/extensions

# Silero TTS (voice)
git clone https://github.com/oobabooga/text-generation-webui-extensions

# SuperBoogav2 (RAG/long-term memory)

# Already included, enable in UI
```

### Enable Extensions

1. **Session** tab → **Extensions**
2. Check boxes for desired extensions
3. Click **Apply and restart**

### Popular Extensions

| Extension         | Purpose             |
| ----------------- | ------------------- |
| silero\_tts       | Voice output        |
| whisper\_stt      | Voice input         |
| superbooga        | Document Q\&A       |
| sd\_api\_pictures | Image generation    |
| multimodal        | Image understanding |

## Performance Tuning

### GGUF Settings

```
n_gpu_layers: 35    # GPU layers (more = faster)
n_ctx: 4096         # Context length
n_batch: 512        # Batch size
threads: 8          # CPU threads
```

### Memory Optimization

For limited VRAM:

```bash
python server.py --listen --n-gpu-layers 20 --no-mmap
```

### Speed Optimization

```bash

# Use llama.cpp with cuBLAS
python server.py --listen --loader llama.cpp --n-gpu-layers 50 --threads 8
```

## Fine-tuning (LoRA)

### Training Tab

1. Go to **Training** tab
2. Load base model
3. Upload dataset (JSON format)
4. Configure:
   * LoRA rank: 8-32
   * Learning rate: 1e-4
   * Epochs: 3-5
5. Start training

### Dataset Format

```json
[
  {"instruction": "Summarize this:", "input": "Long text...", "output": "Summary..."},
  {"instruction": "Translate to French:", "input": "Hello", "output": "Bonjour"}
]
```

## Saving Your Work

```bash

# Save models
rsync -avz /workspace/text-generation-webui/models/ backup-server:/models/

# Save characters
rsync -avz /workspace/text-generation-webui/characters/ backup-server:/characters/

# Save LoRAs
rsync -avz /workspace/text-generation-webui/loras/ backup-server:/loras/
```

## Troubleshooting

### Model won't load

* Check VRAM usage: `nvidia-smi`
* Reduce `n_gpu_layers`
* Use smaller quantization (Q4\_K\_M → Q4\_K\_S)

### Slow generation

* Increase `n_gpu_layers`
* Use EXL2 instead of GGUF
* Enable `--no-mmap`

{% hint style="danger" %}
**Out of memory**
{% endhint %}

during generation - Reduce \`n\_ctx\` (context length) - Use \`--n-gpu-layers 0\` for CPU-only - Try smaller model

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers
