SGLang

Deploy SGLang for high-performance LLM serving with RadixAttention on Clore.ai GPUs

SGLang (Structured Generation Language) is a high-performance LLM serving framework developed by the LMSYS team, known for their work on Vicuna and Chatbot Arena. It features RadixAttention for KV cache sharing, efficient MoE (Mixture of Experts) support, and an OpenAI-compatible API — making it one of the fastest open-source inference engines available on CLORE.AI GPU servers.

circle-check

Server Requirements

Parameter
Minimum
Recommended

RAM

16 GB

32 GB+

VRAM

8 GB

24 GB+

Disk

50 GB

200 GB+

GPU

NVIDIA Turing+ (RTX 2000+)

A100, H100, RTX 4090

circle-info

SGLang achieves best performance on Ampere+ GPUs with FlashInfer enabled. For MoE models like Mixtral or DeepSeek, multi-GPU setups are recommended.

Quick Deploy on CLORE.AI

Docker Image: lmsysorg/sglang:latest

Ports: 22/tcp, 30000/http

Environment Variables:

Variable
Example
Description

HF_TOKEN

hf_xxx...

HuggingFace token for gated models

CUDA_VISIBLE_DEVICES

0,1

GPUs to use

Step-by-Step Setup

1. Rent a GPU Server on CLORE.AI

Visit CLORE.AI Marketplacearrow-up-right and select a server:

  • 7B models: 16 GB VRAM minimum (RTX 4080, A10)

  • 13B models: 24 GB VRAM (RTX 3090, RTX 4090, A5000)

  • 70B models: 80 GB+ VRAM (A100 80GB) or multi-GPU

  • MoE models (Mixtral 8x7B): 48 GB VRAM or 2× 24 GB

2. SSH into Your Server

3. Pull SGLang Docker Image

4. Launch SGLang Server

Basic launch (Llama 3.1 8B):

With HuggingFace token:

Qwen2.5 72B on multi-GPU:

DeepSeek-V2 (MoE model):

5. Check Server Health

6. Access from Outside via CLORE.AI Proxy

Your CLORE.AI dashboard provides an http_pub URL for port 30000:

Use this URL as your base URL in any OpenAI-compatible client.


Usage Examples

Example 1: OpenAI-Compatible Chat Completions

Example 2: Streaming Response

Example 3: Python OpenAI Client

Example 4: Batch Inference with SGLang Native API

SGLang's native API provides additional control:

Example 5: Constrained JSON Output

SGLang supports structured output generation:


Configuration

Key Launch Parameters

Parameter
Default
Description

--model-path

required

HuggingFace model ID or local path

--host

127.0.0.1

Bind host (use 0.0.0.0 for external)

--port

30000

Server port

--tp

1

Tensor parallelism degree (num GPUs)

--dp

1

Data parallelism degree

--dtype

auto

float16, bfloat16, float32

--mem-fraction-static

0.88

Fraction of VRAM for KV cache

--max-prefill-tokens

auto

Max tokens in one prefill step

--context-length

model max

Override maximum context length

--trust-remote-code

false

Allow custom model code

--quantization

none

awq, gptq, fp8

--load-format

auto

auto, pt, safetensors

--tokenizer-path

same as model

Custom tokenizer path

Quantization Options

AWQ (recommended for speed):

FP8 (for H100/A100):


Performance Tips

1. RadixAttention — The Key Advantage

SGLang's RadixAttention automatically reuses KV cache for shared prompt prefixes. This is especially powerful for:

  • Chatbots with long system prompts

  • RAG applications with repeated context

  • Batch API calls sharing the same prefix

No extra configuration needed — it's always enabled.

2. Increase KV Cache Size

Be careful not to go too high — leave room for model weights.

3. Chunked Prefill for Long Contexts

4. Enable FlashInfer Backend

SGLang automatically uses FlashInfer when available (Ampere+ GPUs):

5. Multi-GPU Tensor Parallelism

For models that don't fit on a single GPU:

Each GPU must have enough VRAM for a shard of the model.

6. Tune for Throughput vs Latency

Low latency (single user):

High throughput (many users):


Troubleshooting

Problem: "torch.cuda.OutOfMemoryError"

Solution: Reduce memory fraction or use quantization:

Problem: Server won't start (hangs on loading)

Problem: "trust_remote_code required"

Add --trust-remote-code to the launch command for models with custom architectures (DeepSeek, Falcon, etc.).

Problem: Slow generation on MoE models

MoE models (Mixtral, DeepSeek) are memory-bandwidth bound. Ensure you're using:

Problem: Context length errors

Problem: Port 30000 not accessible

Verify the port is exposed in your CLORE.AI order configuration. Check the http_pub URL in your order dashboard, not localhost.



Clore.ai GPU Recommendations

Use Case
Recommended GPU
Est. Cost on Clore.ai

Development/Testing

RTX 3090 (24GB)

~$0.12/gpu/hr

Production (7B–13B)

RTX 4090 (24GB)

~$0.70/gpu/hr

Large Models (70B+)

A100 80GB / H100

~$1.20/gpu/hr

💡 All examples in this guide can be deployed on Clore.aiarrow-up-right GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.

Last updated

Was this helpful?