# LitGPT **LitGPT** is a high-performance library for pretraining, finetuning, and deploying 20+ large language models built on PyTorch Lightning. With 12K+ GitHub stars, it's a go-to toolkit for engineers who need clean, hackable LLM training code without the abstraction overhead of HuggingFace Transformers. Each model in LitGPT is \~1,000 lines of clean PyTorch — no inheritance chains 10 levels deep, no magic. You can read the Llama 3 implementation end-to-end in an afternoon and modify it confidently. {% hint style="success" %} All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace). {% endhint %} *** ## What is LitGPT? LitGPT provides production-ready implementations of state-of-the-art LLMs with a unified training interface: * **20+ supported models** — Llama 3, Gemma 2, Mistral, Phi-3, Falcon, StableLM, and more * **Pretrain from scratch** — full pretraining with Flash Attention, FSDP, and gradient checkpointing * **Finetune efficiently** — full finetuning, LoRA, QLoRA, and Adapter methods * **Serve with confidence** — built-in inference server with quantization * **Multi-GPU support** — DDP, FSDP, tensor parallelism out of the box * **Memory efficient** — 4-bit quantization, gradient checkpointing, activation checkpointing *** ## Server Requirements | Component | Minimum | Recommended | | --------- | ---------------- | ----------------- | | GPU | RTX 3090 (24 GB) | A100 80 GB / H100 | | VRAM | 16 GB (7B LoRA) | 80 GB+ (70B full) | | RAM | 32 GB | 64 GB+ | | CPU | 8 cores | 16+ cores | | Storage | 100 GB | 500 GB+ | | OS | Ubuntu 20.04+ | Ubuntu 22.04 | | Python | 3.10+ | 3.11 | | CUDA | 11.8+ | 12.1+ | ### VRAM Requirements by Task | Task | Model | VRAM | | ----------------- | ----------- | ----------------- | | Inference (4-bit) | Llama-3 8B | \~6 GB | | LoRA finetune | Llama-3 8B | \~16 GB | | Full finetune | Llama-3 8B | \~80 GB | | LoRA finetune | Llama-3 70B | \~48 GB (2×A100) | | Full finetune | Llama-3 70B | \~640 GB (8×A100) | | QLoRA finetune | Llama-3 8B | \~8 GB | *** ## Ports | Port | Service | Notes | | ---- | ----------------------- | ------------------------------- | | 22 | SSH | Terminal access & file transfer | | 8000 | LitGPT Inference Server | REST API for model serving | *** ## Quick Start with Docker ```bash # Pull the official LitGPT image docker pull pytorchlightning/litgpt:latest # Run interactive container with GPU docker run -it --gpus all \ -p 8000:8000 \ -v $(pwd)/checkpoints:/checkpoints \ -v $(pwd)/data:/data \ pytorchlightning/litgpt:latest \ bash # Or run a specific command directly docker run --gpus all \ -v $(pwd)/checkpoints:/checkpoints \ pytorchlightning/litgpt:latest \ litgpt download --repo_id meta-llama/Llama-3.2-3B-Instruct ``` *** ## Installation on Clore.ai ### Step 1 — Rent a Server 1. Go to [Clore.ai Marketplace](https://clore.ai/marketplace) 2. Filter for **VRAM ≥ 24 GB** (RTX 3090 or better) 3. Choose a **PyTorch** or **CUDA 12.1** base image 4. Open ports **22** and **8000** in your order settings 5. Select **storage ≥ 200 GB** for model weights ### Step 2 — Connect via SSH ```bash ssh root@ -p ``` ### Step 3 — Install LitGPT ```bash # Install via pip (recommended) pip install litgpt # With all extras (quantization, server, etc.) pip install 'litgpt[all]' # Or install from source for latest features git clone https://github.com/Lightning-AI/litgpt.git cd litgpt pip install -e '.[all]' ``` ### Step 4 — Verify Installation ```bash litgpt --help ``` Expected output: ``` Usage: litgpt [OPTIONS] COMMAND [ARGS]... Commands: chat Chat with a model convert Convert model weights download Download model weights evaluate Evaluate a model finetune Finetune a model generate Generate text pretrain Pretrain a model serve Serve a model for inference ``` *** ## Downloading Models LitGPT downloads models from Hugging Face: ```bash # List available models litgpt download --list # Download Llama 3.2 3B (requires HF token for gated models) litgpt download \ --repo_id meta-llama/Llama-3.2-3B-Instruct \ --checkpoint_dir checkpoints/ # Download Mistral 7B (open access) litgpt download \ --repo_id mistralai/Mistral-7B-Instruct-v0.3 # Download Gemma 2 2B litgpt download \ --repo_id google/gemma-2-2b-it \ --access_token your-hf-token # Download Phi-3 (small but powerful) litgpt download \ --repo_id microsoft/Phi-3-mini-4k-instruct ``` ### Set HuggingFace Token ```bash # For gated models (Llama, Gemma) export HF_TOKEN=hf_your-token-here # Or authenticate via CLI pip install huggingface_hub huggingface-cli login ``` *** ## Inference (Chat & Generate) ```bash # Interactive chat litgpt chat \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct # Single generation litgpt generate \ --prompt "Explain GPU computing in simple terms" \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \ --max_new_tokens 200 # With temperature and sampling litgpt generate \ --prompt "Write a Python function to sort a list" \ --checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.3 \ --temperature 0.7 \ --top_p 0.9 \ --max_new_tokens 500 ``` *** ## Finetuning ### LoRA Finetuning (Recommended) LoRA trains a small set of adapter parameters (typically 0.1–1% of total weights) while the base model stays frozen. Llama 3 8B LoRA on 10K examples takes \~2 hours on an RTX 3090 with `r=16`. ```bash # Prepare your dataset # Format: JSON lines with {"instruction": "...", "input": "...", "output": "..."} cat > data/train.json << 'EOF' {"instruction": "What is GPU cloud computing?", "input": "", "output": "GPU cloud computing provides on-demand access to GPU hardware through the internet, enabling AI training and inference without owning physical hardware."} {"instruction": "How do I rent a GPU on Clore.ai?", "input": "", "output": "Visit clore.ai/marketplace, filter by GPU specs, select a server, configure ports, and click rent. SSH access is provided immediately."} EOF # Finetune with LoRA litgpt finetune lora \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \ --data JSON \ --data.json_path data/train.json \ --train.epochs 3 \ --train.micro_batch_size 4 \ --lora_r 8 \ --lora_alpha 16 \ --out_dir out/llama-lora-finetuned # Monitor training # LitGPT outputs logs with loss, learning rate, and ETA ``` ### QLoRA (4-bit + LoRA) Use QLoRA to finetune large models on limited VRAM. Llama 3 8B fits on a single RTX 3090 at 24 GB: ```bash litgpt finetune lora \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-8B-Instruct \ --quantize bnb.nf4 \ --train.epochs 3 \ --train.micro_batch_size 2 \ --lora_r 16 \ --lora_alpha 32 \ --out_dir out/llama-qlora ``` ### Full Finetuning ```bash litgpt finetune full \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \ --data JSON \ --data.json_path data/train.json \ --train.epochs 2 \ --train.micro_batch_size 2 \ --train.accumulate_gradients 8 \ --out_dir out/llama-full-finetuned ``` ### Multi-GPU Training ```bash # Use FSDP across multiple GPUs litgpt finetune full \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-8B-Instruct \ --devices 4 \ --strategy fsdp \ --train.epochs 3 \ --out_dir out/llama-multigpu ``` *** ## Serving Models (REST API) ```bash # Start inference server litgpt serve \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \ --host 0.0.0.0 \ --port 8000 # Test the API curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{ "prompt": "What is the capital of France?", "max_new_tokens": 100, "temperature": 0.7 }' ``` ### Python Client ```python import requests response = requests.post( "http://:8000/predict", json={ "prompt": "Explain reinforcement learning", "max_new_tokens": 500, "temperature": 0.8, "top_p": 0.9, } ) print(response.json()["output"]) ``` *** ## Pretraining from Scratch For training a custom LLM from scratch on your own data: ```bash # Prepare pretraining data (tokenized and chunked) python scripts/prepare_redpajama.py \ --source_path /data/raw_text \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \ --destination_path /data/tokenized # Start pretraining litgpt pretrain \ --model_name Llama-3.2 \ --data /data/tokenized \ --train.micro_batch_size 4 \ --train.max_tokens 10_000_000_000 \ --devices 8 \ --strategy fsdp \ --out_dir out/my-pretrained-llm ``` *** ## Converting and Exporting Models ```bash # Merge LoRA weights into base model litgpt merge_lora \ --checkpoint_dir out/llama-lora-finetuned # Convert to HuggingFace format for distribution litgpt convert to_hf \ --checkpoint_dir out/llama-lora-finetuned/final \ --output_dir hf_model/ # Export to GGUF format (for Ollama/LlamaCpp) # Use llama.cpp conversion script after HF export python llama.cpp/convert.py hf_model/ --outfile model.gguf ``` *** ## Evaluating Models ```bash # Run MMLU benchmark litgpt evaluate \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \ --tasks mmlu \ --num_fewshot 5 # Run multiple benchmarks litgpt evaluate \ --checkpoint_dir out/llama-lora-finetuned/final \ --tasks "mmlu,hellaswag,truthfulqa_mc" ``` *** ## Clore.ai GPU Recommendations LitGPT covers three distinct workloads — inference, LoRA finetuning, and full pretraining — each with different GPU requirements. | Workload | GPU | VRAM | Notes | | ------------------------------------- | -------------- | ----- | -------------------------------------------------- | | Inference / chat (7–8B models) | **RTX 3090** | 24 GB | Fits Llama 3 8B in bf16; \~95 tok/s generation | | LoRA finetune (7–8B models) | **RTX 3090** | 24 GB | Budget pick; QLoRA keeps VRAM under 10 GB | | LoRA finetune (7–8B), fast iteration | **RTX 4090** | 24 GB | \~35% faster than 3090; reduces 2hr job to \~1.4hr | | Full finetune (7B) or QLoRA (70B) | **A100 40 GB** | 40 GB | 40 GB fits 7B full-precision or 70B 4-bit | | Full finetune (13B+) or pretrain runs | **A100 80 GB** | 80 GB | Highest throughput; \~2,800 tok/sec training on 8B | **Recommended for most users:** RTX 3090 pair (2×24 GB = 48 GB effective with FSDP). Handles QLoRA on 70B models, or full finetune on 7B models with tensor parallelism. Cost on Clore.ai: \~$0.25/hr for two 3090s. **For pretraining or >70B finetuning:** Use 4×A100 80GB with FSDP. LitGPT's FSDP integration handles sharding transparently — just pass `--devices 4 --strategy fsdp`. *** ## Troubleshooting ### CUDA Out of Memory ```bash # Reduce batch size --train.micro_batch_size 1 # Enable gradient checkpointing --train.gradient_checkpointing true # Use QLoRA instead of LoRA --quantize bnb.nf4 # Check GPU memory nvidia-smi ``` ### Download fails / HuggingFace 401 ```bash # Set HF token export HF_TOKEN=hf_your-token-here huggingface-cli login # Or pass directly litgpt download \ --repo_id meta-llama/Llama-3.2-3B-Instruct \ --access_token hf_your-token ``` ### Training loss doesn't decrease ```bash # Check your data format — must be valid JSON Lines python -c " import json with open('data/train.json') as f: for i, line in enumerate(f): json.loads(line) if i < 3: print(f'Line {i}: OK') print('All lines valid') " # Reduce learning rate --train.lr 1e-5 # Default is often too high for small datasets # Check data size — LoRA needs at least 100-1000 examples wc -l data/train.json ``` ### Server port 8000 not accessible ```bash # Verify server is listening ss -tlnp | grep 8000 # Open firewall ufw allow 8000/tcp # Restart server with explicit host litgpt serve \ --checkpoint_dir checkpoints/... \ --host 0.0.0.0 \ --port 8000 ``` ### Multi-GPU training hangs ```bash # Check NCCL connectivity python -c "import torch; print(torch.cuda.device_count())" # Try DDP instead of FSDP for smaller models --strategy ddp # Set NCCL environment variables export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=1 # If InfiniBand is not available ``` *** ## Useful Links * **GitHub**: ⭐ 12K+ * **Documentation**: * **PyTorch Lightning**: * **HuggingFace Models**: * **Discord**: * **Clore.ai Marketplace**: --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://docs.clore.ai/guides/training/litgpt.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.