# LFM2-24B-A2B

> LFM2-24B-A2B represents a breakthrough in efficient language modeling through Liquid AI's hybrid **State Space Model + Attention** architecture. With 24B total parameters but only 2B active per token, it delivers impressive performance while requiring just \~6GB VRAM for FP16 inference. The model achieves \~350 tok/s on RTX 4090, making it one of the fastest large language models available.

## At a Glance

* **Model Size**: 24B total / 2B active parameters (hybrid SSM+Attention)
* **License**: Liquid AI Open License (non-commercial free, commercial license available)
* **Context**: 32K tokens
* **Performance**: Competitive with 7B-13B dense models
* **VRAM**: \~6GB FP16, \~3GB INT8
* **Speed**: \~350 tok/s on RTX 4090, \~200 tok/s on RTX 3090

## Why LFM2-24B-A2B?

**Revolutionary Architecture**: LFM2-24B-A2B combines State Space Models (SSMs) with selective attention mechanisms. SSMs handle sequential processing efficiently while attention layers focus on complex reasoning. This hybrid approach achieves large model quality with small model efficiency.

**Exceptional Speed**: The 2B active parameter design enables lightning-fast inference. Unlike traditional models where all parameters activate, LFM2 selectively engages only the necessary components, resulting in 350+ tokens/second on consumer hardware.

**Memory Efficient**: At only 6GB VRAM for FP16, LFM2-24B-A2B runs comfortably on mid-range GPUs. This makes it ideal for edge deployment, development environments, and cost-conscious production setups.

**Liquid AI Innovation**: Developed by Liquid AI (founded by MIT researchers), LFM2 represents cutting-edge research in neural architecture. The hybrid SSM+Attention design may be the future of efficient language modeling.

**Licensing Note**: The Liquid AI Open License permits free non-commercial use. Commercial deployment requires a separate license from Liquid AI. This is **not** MIT — verify licensing terms before production use.

## GPU Recommendations

| GPU             | VRAM | Performance     | Daily Cost\* |
| --------------- | ---- | --------------- | ------------ |
| RTX 3060 12GB   | 12GB | \~180 tok/s     | \~$0.80      |
| RTX 3070        | 8GB  | \~220 tok/s     | \~$0.90      |
| **RTX 4060 Ti** | 16GB | \~300 tok/s     | \~$1.20      |
| **RTX 4090**    | 24GB | **\~350 tok/s** | \~$2.10      |
| RTX 3090        | 24GB | \~200 tok/s     | \~$1.10      |
| A100 40GB       | 40GB | \~400 tok/s     | \~$3.50      |

**Best Value**: RTX 4060 Ti 16GB offers excellent performance per dollar. **Maximum Speed**: RTX 4090 unleashes LFM2's full potential.

\*Estimated Clore.ai marketplace prices

## Deploy with vLLM

### Install vLLM

```bash
pip install vllm>=0.6.0
# or latest
pip install git+https://github.com/vllm-project/vllm.git
```

### Single GPU Setup

```bash
vllm serve liquid-ai/LFM2-24B-A2B \
  --model liquid-ai/LFM2-24B-A2B \
  --tensor-parallel-size 1 \
  --dtype float16 \
  --max-model-len 32768 \
  --served-model-name lfm2-24b \
  --trust-remote-code \
  --disable-log-stats
```

### Query the Server

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1", 
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="lfm2-24b",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant specializing in technical explanations."},
        {"role": "user", "content": "Explain the differences between State Space Models and traditional Transformers"}
    ],
    max_tokens=1024,
    temperature=0.7
)

print(response.choices[0].message.content)
```

## Deploy with Ollama

Ollama provides the simplest deployment path:

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull LFM2 model
ollama pull liquid-ai/lfm2:24b

# Run interactively
ollama run liquid-ai/lfm2:24b

# API mode
ollama serve
```

### Ollama API Usage

```python
import requests

# Simple completion
response = requests.post('http://localhost:11434/api/generate',
    json={
        'model': 'liquid-ai/lfm2:24b',
        'prompt': 'Write a Python function to calculate Fibonacci numbers using memoization',
        'stream': False
    }
)

print(response.json()['response'])

# Chat format
chat_response = requests.post('http://localhost:11434/api/chat',
    json={
        'model': 'liquid-ai/lfm2:24b',
        'messages': [
            {'role': 'user', 'content': 'Explain quantum entanglement in simple terms'}
        ],
        'stream': False
    }
)

print(chat_response.json()['message']['content'])
```

## Docker Template

```dockerfile
FROM nvidia/cuda:12.1-devel-ubuntu22.04

# Install Python 3.10
RUN apt-get update && apt-get install -y \
    python3.10 python3-pip curl && \
    rm -rf /var/lib/apt/lists/*

# Install vLLM
RUN pip install vllm>=0.6.0 transformers

# Set environment
ENV PYTHONUNBUFFERED=1

# Pre-download model (optional)
# RUN python3 -c "from transformers import AutoModel; AutoModel.from_pretrained('liquid-ai/LFM2-24B-A2B', trust_remote_code=True)"

EXPOSE 8000

CMD ["vllm", "serve", "liquid-ai/LFM2-24B-A2B", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--dtype", "float16", \
     "--max-model-len", "16384", \
     "--trust-remote-code"]
```

Build and run:

```bash
docker build -t lfm2-24b .
docker run --gpus all -p 8000:8000 lfm2-24b
```

## Speed Benchmark

Test LFM2's exceptional inference speed:

```python
import time
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

def speed_test():
    prompts = [
        "Explain machine learning in one paragraph",
        "Write a quick Python sorting algorithm",
        "Describe the benefits of renewable energy",
        "What is the capital of France and why is it important?",
        "Create a simple HTML page structure"
    ]
    
    total_tokens = 0
    total_time = 0
    
    for prompt in prompts:
        start_time = time.time()
        
        response = client.chat.completions.create(
            model="lfm2-24b",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=200,
            temperature=0.1
        )
        
        end_time = time.time()
        
        tokens = len(response.choices[0].message.content.split())
        duration = end_time - start_time
        
        total_tokens += tokens
        total_time += duration
        
        print(f"Prompt: {prompt[:30]}...")
        print(f"Tokens: {tokens}, Time: {duration:.2f}s, Speed: {tokens/duration:.1f} tok/s\n")
    
    avg_speed = total_tokens / total_time
    print(f"Average speed: {avg_speed:.1f} tokens/second")
    return avg_speed

# Run speed test
speed_test()
```

## Quantization for Lower VRAM

For GPUs with limited VRAM, use quantized versions:

### GPTQ Quantization

```bash
# Install auto-gptq
pip install auto-gptq

# Use quantized model (reduces to ~3GB VRAM)
vllm serve liquid-ai/LFM2-24B-A2B-GPTQ \
  --model liquid-ai/LFM2-24B-A2B-GPTQ \
  --quantization gptq \
  --dtype float16 \
  --max-model-len 16384
```

### AWQ Quantization

```bash
# Install autoawq
pip install autoawq

# Use AWQ quantized model
vllm serve liquid-ai/LFM2-24B-A2B-AWQ \
  --model liquid-ai/LFM2-24B-A2B-AWQ \
  --quantization awq \
  --dtype float16
```

## Advanced Configuration

### Memory-Optimized Setup

For 8GB GPUs:

```bash
vllm serve liquid-ai/LFM2-24B-A2B \
  --model liquid-ai/LFM2-24B-A2B \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --swap-space 4 \
  --trust-remote-code
```

### High-Throughput Setup

For production workloads:

```bash
vllm serve liquid-ai/LFM2-24B-A2B \
  --model liquid-ai/LFM2-24B-A2B \
  --tensor-parallel-size 1 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 8192 \
  --dtype float16 \
  --trust-remote-code
```

## SSM Architecture Benefits

LFM2's hybrid SSM+Attention provides unique advantages:

**Linear Scaling**: SSMs scale linearly with sequence length, while traditional transformers scale quadratically. This enables efficient long-context processing.

**Selective Attention**: Only critical tokens trigger full attention mechanisms, reducing computational overhead.

**Memory Efficiency**: The 2B active parameter design means most of the 24B parameters remain dormant during inference, drastically reducing memory bandwidth requirements.

**Fast Sequential Processing**: SSMs excel at sequential tasks like text generation, achieving higher throughput than pure attention mechanisms.

## Tips for Clore.ai Users

* **Single GPU Focused**: LFM2-24B-A2B is optimized for single-GPU deployment. Multi-GPU setups don't provide significant benefits.
* **Context Length**: Use shorter contexts (8K-16K) for maximum speed. Longer contexts reduce the SSM efficiency advantage.
* **Temperature Settings**: Lower temperatures (0.1-0.3) maximize inference speed by reducing uncertainty.
* **Batch Size**: Increase batch size for multiple concurrent requests rather than using multiple GPUs.
* **License Compliance**: Verify commercial licensing requirements with Liquid AI before production deployment.

## Troubleshooting

| Issue                              | Solution                                                                               |
| ---------------------------------- | -------------------------------------------------------------------------------------- |
| `ImportError: liquid_transformers` | Install: `pip install git+https://github.com/LiquidAI-project/liquid-transformers.git` |
| Slow startup                       | Pre-download: `huggingface-cli download liquid-ai/LFM2-24B-A2B`                        |
| `OutOfMemoryError`                 | Use quantized version or reduce `max-model-len`                                        |
| Poor quality responses             | Check license restrictions — some model versions have limited capabilities             |
| SSM layer errors                   | Update transformers: `pip install transformers>=4.45.0`                                |

## Performance Comparison

| Model            | Active Params | VRAM (FP16) | Speed (RTX 4090) |
| ---------------- | ------------- | ----------- | ---------------- |
| Llama 3.2 3B     | 3B            | \~6GB       | \~280 tok/s      |
| Qwen2.5 7B       | 7B            | \~14GB      | \~180 tok/s      |
| **LFM2-24B-A2B** | **2B**        | **\~6GB**   | **\~350 tok/s**  |
| Mistral 7B       | 7B            | \~14GB      | \~200 tok/s      |
| Phi-3.5 3.8B     | 3.8B          | \~8GB       | \~250 tok/s      |

LFM2-24B-A2B achieves the best speed-per-VRAM ratio in its class.

## Resources

* [LFM2-24B-A2B on Hugging Face](https://huggingface.co/liquid-ai/LFM2-24B-A2B)
* [Liquid AI Company](https://liquid.ai/)
* [SSM Architecture Paper](https://arxiv.org/abs/2312.00752)
* [Liquid AI Licensing](https://liquid.ai/licensing)
* [vLLM SSM Support](https://docs.vllm.ai/en/latest/models/supported_models.html#liquid-ai)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/lfm2-24b.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
