# LitGPT

**LitGPT** is a high-performance library for pretraining, finetuning, and deploying 20+ large language models built on PyTorch Lightning. With 12K+ GitHub stars, it's a go-to toolkit for engineers who need clean, hackable LLM training code without the abstraction overhead of HuggingFace Transformers.

Each model in LitGPT is \~1,000 lines of clean PyTorch — no inheritance chains 10 levels deep, no magic. You can read the Llama 3 implementation end-to-end in an afternoon and modify it confidently.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

***

## What is LitGPT?

LitGPT provides production-ready implementations of state-of-the-art LLMs with a unified training interface:

* **20+ supported models** — Llama 3, Gemma 2, Mistral, Phi-3, Falcon, StableLM, and more
* **Pretrain from scratch** — full pretraining with Flash Attention, FSDP, and gradient checkpointing
* **Finetune efficiently** — full finetuning, LoRA, QLoRA, and Adapter methods
* **Serve with confidence** — built-in inference server with quantization
* **Multi-GPU support** — DDP, FSDP, tensor parallelism out of the box
* **Memory efficient** — 4-bit quantization, gradient checkpointing, activation checkpointing

***

## Server Requirements

| Component | Minimum          | Recommended       |
| --------- | ---------------- | ----------------- |
| GPU       | RTX 3090 (24 GB) | A100 80 GB / H100 |
| VRAM      | 16 GB (7B LoRA)  | 80 GB+ (70B full) |
| RAM       | 32 GB            | 64 GB+            |
| CPU       | 8 cores          | 16+ cores         |
| Storage   | 100 GB           | 500 GB+           |
| OS        | Ubuntu 20.04+    | Ubuntu 22.04      |
| Python    | 3.10+            | 3.11              |
| CUDA      | 11.8+            | 12.1+             |

### VRAM Requirements by Task

| Task              | Model       | VRAM              |
| ----------------- | ----------- | ----------------- |
| Inference (4-bit) | Llama-3 8B  | \~6 GB            |
| LoRA finetune     | Llama-3 8B  | \~16 GB           |
| Full finetune     | Llama-3 8B  | \~80 GB           |
| LoRA finetune     | Llama-3 70B | \~48 GB (2×A100)  |
| Full finetune     | Llama-3 70B | \~640 GB (8×A100) |
| QLoRA finetune    | Llama-3 8B  | \~8 GB            |

***

## Ports

| Port | Service                 | Notes                           |
| ---- | ----------------------- | ------------------------------- |
| 22   | SSH                     | Terminal access & file transfer |
| 8000 | LitGPT Inference Server | REST API for model serving      |

***

## Quick Start with Docker

```bash
# Pull the official LitGPT image
docker pull pytorchlightning/litgpt:latest

# Run interactive container with GPU
docker run -it --gpus all \
  -p 8000:8000 \
  -v $(pwd)/checkpoints:/checkpoints \
  -v $(pwd)/data:/data \
  pytorchlightning/litgpt:latest \
  bash

# Or run a specific command directly
docker run --gpus all \
  -v $(pwd)/checkpoints:/checkpoints \
  pytorchlightning/litgpt:latest \
  litgpt download --repo_id meta-llama/Llama-3.2-3B-Instruct
```

***

## Installation on Clore.ai

### Step 1 — Rent a Server

1. Go to [Clore.ai Marketplace](https://clore.ai/marketplace)
2. Filter for **VRAM ≥ 24 GB** (RTX 3090 or better)
3. Choose a **PyTorch** or **CUDA 12.1** base image
4. Open ports **22** and **8000** in your order settings
5. Select **storage ≥ 200 GB** for model weights

### Step 2 — Connect via SSH

```bash
ssh root@<server-ip> -p <ssh-port>
```

### Step 3 — Install LitGPT

```bash
# Install via pip (recommended)
pip install litgpt

# With all extras (quantization, server, etc.)
pip install 'litgpt[all]'

# Or install from source for latest features
git clone https://github.com/Lightning-AI/litgpt.git
cd litgpt
pip install -e '.[all]'
```

### Step 4 — Verify Installation

```bash
litgpt --help
```

Expected output:

```
Usage: litgpt [OPTIONS] COMMAND [ARGS]...
  
Commands:
  chat       Chat with a model
  convert    Convert model weights
  download   Download model weights
  evaluate   Evaluate a model
  finetune   Finetune a model
  generate   Generate text
  pretrain   Pretrain a model
  serve      Serve a model for inference
```

***

## Downloading Models

LitGPT downloads models from Hugging Face:

```bash
# List available models
litgpt download --list

# Download Llama 3.2 3B (requires HF token for gated models)
litgpt download \
  --repo_id meta-llama/Llama-3.2-3B-Instruct \
  --checkpoint_dir checkpoints/

# Download Mistral 7B (open access)
litgpt download \
  --repo_id mistralai/Mistral-7B-Instruct-v0.3

# Download Gemma 2 2B
litgpt download \
  --repo_id google/gemma-2-2b-it \
  --access_token your-hf-token

# Download Phi-3 (small but powerful)
litgpt download \
  --repo_id microsoft/Phi-3-mini-4k-instruct
```

### Set HuggingFace Token

```bash
# For gated models (Llama, Gemma)
export HF_TOKEN=hf_your-token-here

# Or authenticate via CLI
pip install huggingface_hub
huggingface-cli login
```

***

## Inference (Chat & Generate)

```bash
# Interactive chat
litgpt chat \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct

# Single generation
litgpt generate \
  --prompt "Explain GPU computing in simple terms" \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \
  --max_new_tokens 200

# With temperature and sampling
litgpt generate \
  --prompt "Write a Python function to sort a list" \
  --checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.3 \
  --temperature 0.7 \
  --top_p 0.9 \
  --max_new_tokens 500
```

***

## Finetuning

### LoRA Finetuning (Recommended)

LoRA trains a small set of adapter parameters (typically 0.1–1% of total weights) while the base model stays frozen. Llama 3 8B LoRA on 10K examples takes \~2 hours on an RTX 3090 with `r=16`.

```bash
# Prepare your dataset
# Format: JSON lines with {"instruction": "...", "input": "...", "output": "..."}
cat > data/train.json << 'EOF'
{"instruction": "What is GPU cloud computing?", "input": "", "output": "GPU cloud computing provides on-demand access to GPU hardware through the internet, enabling AI training and inference without owning physical hardware."}
{"instruction": "How do I rent a GPU on Clore.ai?", "input": "", "output": "Visit clore.ai/marketplace, filter by GPU specs, select a server, configure ports, and click rent. SSH access is provided immediately."}
EOF

# Finetune with LoRA
litgpt finetune lora \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \
  --data JSON \
  --data.json_path data/train.json \
  --train.epochs 3 \
  --train.micro_batch_size 4 \
  --lora_r 8 \
  --lora_alpha 16 \
  --out_dir out/llama-lora-finetuned

# Monitor training
# LitGPT outputs logs with loss, learning rate, and ETA
```

### QLoRA (4-bit + LoRA)

Use QLoRA to finetune large models on limited VRAM. Llama 3 8B fits on a single RTX 3090 at 24 GB:

```bash
litgpt finetune lora \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-8B-Instruct \
  --quantize bnb.nf4 \
  --train.epochs 3 \
  --train.micro_batch_size 2 \
  --lora_r 16 \
  --lora_alpha 32 \
  --out_dir out/llama-qlora
```

### Full Finetuning

```bash
litgpt finetune full \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \
  --data JSON \
  --data.json_path data/train.json \
  --train.epochs 2 \
  --train.micro_batch_size 2 \
  --train.accumulate_gradients 8 \
  --out_dir out/llama-full-finetuned
```

### Multi-GPU Training

```bash
# Use FSDP across multiple GPUs
litgpt finetune full \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-8B-Instruct \
  --devices 4 \
  --strategy fsdp \
  --train.epochs 3 \
  --out_dir out/llama-multigpu
```

***

## Serving Models (REST API)

```bash
# Start inference server
litgpt serve \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \
  --host 0.0.0.0 \
  --port 8000

# Test the API
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is the capital of France?",
    "max_new_tokens": 100,
    "temperature": 0.7
  }'
```

### Python Client

```python
import requests

response = requests.post(
    "http://<server-ip>:8000/predict",
    json={
        "prompt": "Explain reinforcement learning",
        "max_new_tokens": 500,
        "temperature": 0.8,
        "top_p": 0.9,
    }
)
print(response.json()["output"])
```

***

## Pretraining from Scratch

For training a custom LLM from scratch on your own data:

```bash
# Prepare pretraining data (tokenized and chunked)
python scripts/prepare_redpajama.py \
  --source_path /data/raw_text \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \
  --destination_path /data/tokenized

# Start pretraining
litgpt pretrain \
  --model_name Llama-3.2 \
  --data /data/tokenized \
  --train.micro_batch_size 4 \
  --train.max_tokens 10_000_000_000 \
  --devices 8 \
  --strategy fsdp \
  --out_dir out/my-pretrained-llm
```

***

## Converting and Exporting Models

```bash
# Merge LoRA weights into base model
litgpt merge_lora \
  --checkpoint_dir out/llama-lora-finetuned

# Convert to HuggingFace format for distribution
litgpt convert to_hf \
  --checkpoint_dir out/llama-lora-finetuned/final \
  --output_dir hf_model/

# Export to GGUF format (for Ollama/LlamaCpp)
# Use llama.cpp conversion script after HF export
python llama.cpp/convert.py hf_model/ --outfile model.gguf
```

***

## Evaluating Models

```bash
# Run MMLU benchmark
litgpt evaluate \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \
  --tasks mmlu \
  --num_fewshot 5

# Run multiple benchmarks
litgpt evaluate \
  --checkpoint_dir out/llama-lora-finetuned/final \
  --tasks "mmlu,hellaswag,truthfulqa_mc"
```

***

## Clore.ai GPU Recommendations

LitGPT covers three distinct workloads — inference, LoRA finetuning, and full pretraining — each with different GPU requirements.

| Workload                              | GPU            | VRAM  | Notes                                              |
| ------------------------------------- | -------------- | ----- | -------------------------------------------------- |
| Inference / chat (7–8B models)        | **RTX 3090**   | 24 GB | Fits Llama 3 8B in bf16; \~95 tok/s generation     |
| LoRA finetune (7–8B models)           | **RTX 3090**   | 24 GB | Budget pick; QLoRA keeps VRAM under 10 GB          |
| LoRA finetune (7–8B), fast iteration  | **RTX 4090**   | 24 GB | \~35% faster than 3090; reduces 2hr job to \~1.4hr |
| Full finetune (7B) or QLoRA (70B)     | **A100 40 GB** | 40 GB | 40 GB fits 7B full-precision or 70B 4-bit          |
| Full finetune (13B+) or pretrain runs | **A100 80 GB** | 80 GB | Highest throughput; \~2,800 tok/sec training on 8B |

**Recommended for most users:** RTX 3090 pair (2×24 GB = 48 GB effective with FSDP). Handles QLoRA on 70B models, or full finetune on 7B models with tensor parallelism. Cost on Clore.ai: \~$0.25/hr for two 3090s.

**For pretraining or >70B finetuning:** Use 4×A100 80GB with FSDP. LitGPT's FSDP integration handles sharding transparently — just pass `--devices 4 --strategy fsdp`.

***

## Troubleshooting

### CUDA Out of Memory

```bash
# Reduce batch size
--train.micro_batch_size 1

# Enable gradient checkpointing
--train.gradient_checkpointing true

# Use QLoRA instead of LoRA
--quantize bnb.nf4

# Check GPU memory
nvidia-smi
```

### Download fails / HuggingFace 401

```bash
# Set HF token
export HF_TOKEN=hf_your-token-here
huggingface-cli login

# Or pass directly
litgpt download \
  --repo_id meta-llama/Llama-3.2-3B-Instruct \
  --access_token hf_your-token
```

### Training loss doesn't decrease

```bash
# Check your data format — must be valid JSON Lines
python -c "
import json
with open('data/train.json') as f:
    for i, line in enumerate(f):
        json.loads(line)
        if i < 3: print(f'Line {i}: OK')
print('All lines valid')
"

# Reduce learning rate
--train.lr 1e-5  # Default is often too high for small datasets

# Check data size — LoRA needs at least 100-1000 examples
wc -l data/train.json
```

### Server port 8000 not accessible

```bash
# Verify server is listening
ss -tlnp | grep 8000

# Open firewall
ufw allow 8000/tcp

# Restart server with explicit host
litgpt serve \
  --checkpoint_dir checkpoints/... \
  --host 0.0.0.0 \
  --port 8000
```

### Multi-GPU training hangs

```bash
# Check NCCL connectivity
python -c "import torch; print(torch.cuda.device_count())"

# Try DDP instead of FSDP for smaller models
--strategy ddp

# Set NCCL environment variables
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1  # If InfiniBand is not available
```

***

## Useful Links

* **GitHub**: <https://github.com/Lightning-AI/litgpt> ⭐ 12K+
* **Documentation**: <https://lightning.ai/docs/litgpt>
* **PyTorch Lightning**: <https://lightning.ai>
* **HuggingFace Models**: <https://huggingface.co/models>
* **Discord**: <https://discord.gg/lightning-ai>
* **Clore.ai Marketplace**: <https://clore.ai/marketplace>
