# SGLang

SGLang (Structured Generation Language) is a high-performance LLM serving framework developed by the LMSYS team, known for their work on Vicuna and Chatbot Arena. It features RadixAttention for KV cache sharing, efficient MoE (Mixture of Experts) support, and an OpenAI-compatible API — making it one of the fastest open-source inference engines available on CLORE.AI GPU servers.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter | Minimum                    | Recommended          |
| --------- | -------------------------- | -------------------- |
| RAM       | 16 GB                      | 32 GB+               |
| VRAM      | 8 GB                       | 24 GB+               |
| Disk      | 50 GB                      | 200 GB+              |
| GPU       | NVIDIA Turing+ (RTX 2000+) | A100, H100, RTX 4090 |

{% hint style="info" %}
SGLang achieves best performance on Ampere+ GPUs with FlashInfer enabled. For MoE models like Mixtral or DeepSeek, multi-GPU setups are recommended.
{% endhint %}

## Quick Deploy on CLORE.AI

**Docker Image:** `lmsysorg/sglang:latest`

**Ports:** `22/tcp`, `30000/http`

**Environment Variables:**

| Variable               | Example     | Description                        |
| ---------------------- | ----------- | ---------------------------------- |
| `HF_TOKEN`             | `hf_xxx...` | HuggingFace token for gated models |
| `CUDA_VISIBLE_DEVICES` | `0,1`       | GPUs to use                        |

## Step-by-Step Setup

### 1. Rent a GPU Server on CLORE.AI

Visit [CLORE.AI Marketplace](https://clore.ai/marketplace) and select a server:

* **7B models**: 16 GB VRAM minimum (RTX 4080, A10)
* **13B models**: 24 GB VRAM (RTX 3090, RTX 4090, A5000)
* **70B models**: 80 GB+ VRAM (A100 80GB) or multi-GPU
* **MoE models (Mixtral 8x7B)**: 48 GB VRAM or 2× 24 GB

### 2. SSH into Your Server

```bash
ssh -p <PORT> root@<SERVER_IP>
```

### 3. Pull SGLang Docker Image

```bash
docker pull lmsysorg/sglang:latest
```

### 4. Launch SGLang Server

**Basic launch (Llama 3.1 8B):**

```bash
docker run -d \
  --name sglang \
  --gpus all \
  --shm-size 16g \
  --ipc host \
  -p 30000:30000 \
  -v /root/models:/root/.cache/huggingface \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000
```

**With HuggingFace token:**

```bash
docker run -d \
  --name sglang \
  --gpus all \
  --shm-size 16g \
  --ipc host \
  -p 30000:30000 \
  -v /root/models:/root/.cache/huggingface \
  -e HF_TOKEN=hf_your_token_here \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000 \
    --dtype bfloat16
```

**Qwen2.5 72B on multi-GPU:**

```bash
docker run -d \
  --name sglang \
  --gpus all \
  --shm-size 32g \
  --ipc host \
  -p 30000:30000 \
  -v /root/models:/root/.cache/huggingface \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-72B-Instruct \
    --host 0.0.0.0 \
    --port 30000 \
    --tp 2 \
    --dtype bfloat16
```

**DeepSeek-V2 (MoE model):**

```bash
docker run -d \
  --name sglang \
  --gpus all \
  --shm-size 32g \
  --ipc host \
  -p 30000:30000 \
  -v /root/models:/root/.cache/huggingface \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-V2-Lite-Chat \
    --host 0.0.0.0 \
    --port 30000 \
    --trust-remote-code \
    --tp 1
```

### 5. Check Server Health

```bash
# View logs
docker logs -f sglang

# Health check (wait ~2-3 minutes for model to load)
curl http://localhost:30000/health

# Get model info
curl http://localhost:30000/get_model_info
```

### 6. Access from Outside via CLORE.AI Proxy

Your CLORE.AI dashboard provides an `http_pub` URL for port 30000:

```
https://<order-id>-30000.clore.ai/
```

Use this URL as your base URL in any OpenAI-compatible client.

***

## Usage Examples

### Example 1: OpenAI-Compatible Chat Completions

```bash
curl http://localhost:30000/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "Write a quicksort implementation in Python."}
    ],
    "max_tokens": 512,
    "temperature": 0.2
  }'
```

### Example 2: Streaming Response

```bash
curl http://localhost:30000/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Explain how transformer attention works."}
    ],
    "max_tokens": 800,
    "stream": true
  }' \
  --no-buffer
```

### Example 3: Python OpenAI Client

```python
from openai import OpenAI

# Point to your CLORE.AI SGLang server
client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="none",  # SGLang doesn't require auth by default
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a data science expert."},
        {"role": "user", "content": "What is gradient boosting?"},
    ],
    max_tokens=400,
    temperature=0.7,
)

print(response.choices[0].message.content)
```

### Example 4: Batch Inference with SGLang Native API

SGLang's native API provides additional control:

```python
import requests

# Generate completions
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The future of AI is",
        "sampling_params": {
            "max_new_tokens": 200,
            "temperature": 0.8,
            "top_p": 0.95,
        },
    },
)
print(response.json()["text"])
```

### Example 5: Constrained JSON Output

SGLang supports structured output generation:

```python
import requests

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "city": {"type": "string"},
    },
    "required": ["name", "age", "city"],
}

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Extract information: John Smith, 35 years old, lives in New York.",
        "sampling_params": {
            "max_new_tokens": 100,
            "temperature": 0.0,
        },
        "json_schema": schema,
    },
)
print(response.json()["text"])
# Output: {"name": "John Smith", "age": 35, "city": "New York"}
```

***

## Configuration

### Key Launch Parameters

| Parameter               | Default       | Description                            |
| ----------------------- | ------------- | -------------------------------------- |
| `--model-path`          | required      | HuggingFace model ID or local path     |
| `--host`                | `127.0.0.1`   | Bind host (use `0.0.0.0` for external) |
| `--port`                | `30000`       | Server port                            |
| `--tp`                  | `1`           | Tensor parallelism degree (num GPUs)   |
| `--dp`                  | `1`           | Data parallelism degree                |
| `--dtype`               | `auto`        | `float16`, `bfloat16`, `float32`       |
| `--mem-fraction-static` | `0.88`        | Fraction of VRAM for KV cache          |
| `--max-prefill-tokens`  | auto          | Max tokens in one prefill step         |
| `--context-length`      | model max     | Override maximum context length        |
| `--trust-remote-code`   | false         | Allow custom model code                |
| `--quantization`        | none          | `awq`, `gptq`, `fp8`                   |
| `--load-format`         | `auto`        | `auto`, `pt`, `safetensors`            |
| `--tokenizer-path`      | same as model | Custom tokenizer path                  |

### Quantization Options

**AWQ (recommended for speed):**

```bash
python3 -m sglang.launch_server \
  --model-path casperhansen/mistral-7b-instruct-v0.2-awq \
  --quantization awq \
  --host 0.0.0.0 \
  --port 30000
```

**FP8 (for H100/A100):**

```bash
python3 -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --host 0.0.0.0 \
  --port 30000
```

***

## Performance Tips

### 1. RadixAttention — The Key Advantage

SGLang's RadixAttention automatically reuses KV cache for shared prompt prefixes. This is especially powerful for:

* Chatbots with long system prompts
* RAG applications with repeated context
* Batch API calls sharing the same prefix

No extra configuration needed — it's always enabled.

### 2. Increase KV Cache Size

```bash
--mem-fraction-static 0.90  # Use 90% of VRAM for KV cache
```

Be careful not to go too high — leave room for model weights.

### 3. Chunked Prefill for Long Contexts

```bash
--chunked-prefill-size 4096  # Process long prompts in chunks
```

### 4. Enable FlashInfer Backend

SGLang automatically uses FlashInfer when available (Ampere+ GPUs):

```bash
--attention-backend flashinfer
```

### 5. Multi-GPU Tensor Parallelism

For models that don't fit on a single GPU:

```bash
--tp 4  # Use 4 GPUs
```

Each GPU must have enough VRAM for a shard of the model.

### 6. Tune for Throughput vs Latency

**Low latency (single user):**

```bash
--max-running-requests 4
```

**High throughput (many users):**

```bash
--max-running-requests 64 \
--schedule-policy lpm  # Longest Prefix Match scheduling
```

***

## Troubleshooting

### Problem: "torch.cuda.OutOfMemoryError"

```
torch.cuda.OutOfMemoryError: CUDA out of memory
```

**Solution:** Reduce memory fraction or use quantization:

```bash
--mem-fraction-static 0.80
# or
--quantization awq
```

### Problem: Server won't start (hangs on loading)

```bash
# Check CUDA availability
docker exec -it sglang nvidia-smi

# Check model download progress
docker logs -f sglang 2>&1 | tail -50
```

### Problem: "trust\_remote\_code required"

Add `--trust-remote-code` to the launch command for models with custom architectures (DeepSeek, Falcon, etc.).

### Problem: Slow generation on MoE models

MoE models (Mixtral, DeepSeek) are memory-bandwidth bound. Ensure you're using:

```bash
--dtype bfloat16  # Better than float16 for MoE
--tp 2            # Split across GPUs if available
```

### Problem: Context length errors

```bash
# Override context length
--context-length 32768
```

### Problem: Port 30000 not accessible

Verify the port is exposed in your CLORE.AI order configuration. Check the http\_pub URL in your order dashboard, not localhost.

***

## Links

* [GitHub](https://github.com/sgl-project/sglang)
* [Documentation](https://sgl-project.github.io/start/install.html)
* [Docker Hub](https://hub.docker.com/r/lmsysorg/sglang)
* [Supported Models](https://github.com/sgl-project/sglang?tab=readme-ov-file#supported-models)
* [CLORE.AI Marketplace](https://clore.ai/marketplace)

***

## Clore.ai GPU Recommendations

| Use Case            | Recommended GPU  | Est. Cost on Clore.ai |
| ------------------- | ---------------- | --------------------- |
| Development/Testing | RTX 3090 (24GB)  | \~$0.12/gpu/hr        |
| Production (7B–13B) | RTX 4090 (24GB)  | \~$0.70/gpu/hr        |
| Large Models (70B+) | A100 80GB / H100 | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/sglang.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
