LFM2-24B-A2B

Deploy LFM2-24B-A2B by Liquid AI on Clore.ai — hybrid SSM+Attention architecture with 24B total / 2B active parameters

LFM2-24B-A2B represents a breakthrough in efficient language modeling through Liquid AI's hybrid State Space Model + Attention architecture. With 24B total parameters but only 2B active per token, it delivers impressive performance while requiring just ~6GB VRAM for FP16 inference. The model achieves ~350 tok/s on RTX 4090, making it one of the fastest large language models available.

At a Glance

  • Model Size: 24B total / 2B active parameters (hybrid SSM+Attention)

  • License: Liquid AI Open License (non-commercial free, commercial license available)

  • Context: 32K tokens

  • Performance: Competitive with 7B-13B dense models

  • VRAM: ~6GB FP16, ~3GB INT8

  • Speed: ~350 tok/s on RTX 4090, ~200 tok/s on RTX 3090

Why LFM2-24B-A2B?

Revolutionary Architecture: LFM2-24B-A2B combines State Space Models (SSMs) with selective attention mechanisms. SSMs handle sequential processing efficiently while attention layers focus on complex reasoning. This hybrid approach achieves large model quality with small model efficiency.

Exceptional Speed: The 2B active parameter design enables lightning-fast inference. Unlike traditional models where all parameters activate, LFM2 selectively engages only the necessary components, resulting in 350+ tokens/second on consumer hardware.

Memory Efficient: At only 6GB VRAM for FP16, LFM2-24B-A2B runs comfortably on mid-range GPUs. This makes it ideal for edge deployment, development environments, and cost-conscious production setups.

Liquid AI Innovation: Developed by Liquid AI (founded by MIT researchers), LFM2 represents cutting-edge research in neural architecture. The hybrid SSM+Attention design may be the future of efficient language modeling.

Licensing Note: The Liquid AI Open License permits free non-commercial use. Commercial deployment requires a separate license from Liquid AI. This is not MIT — verify licensing terms before production use.

GPU Recommendations

GPU
VRAM
Performance
Daily Cost*

RTX 3060 12GB

12GB

~180 tok/s

~$0.80

RTX 3070

8GB

~220 tok/s

~$0.90

RTX 4060 Ti

16GB

~300 tok/s

~$1.20

RTX 4090

24GB

~350 tok/s

~$2.10

RTX 3090

24GB

~200 tok/s

~$1.10

A100 40GB

40GB

~400 tok/s

~$3.50

Best Value: RTX 4060 Ti 16GB offers excellent performance per dollar. Maximum Speed: RTX 4090 unleashes LFM2's full potential.

*Estimated Clore.ai marketplace prices

Deploy with vLLM

Install vLLM

Single GPU Setup

Query the Server

Deploy with Ollama

Ollama provides the simplest deployment path:

Ollama API Usage

Docker Template

Build and run:

Speed Benchmark

Test LFM2's exceptional inference speed:

Quantization for Lower VRAM

For GPUs with limited VRAM, use quantized versions:

GPTQ Quantization

AWQ Quantization

Advanced Configuration

Memory-Optimized Setup

For 8GB GPUs:

High-Throughput Setup

For production workloads:

SSM Architecture Benefits

LFM2's hybrid SSM+Attention provides unique advantages:

Linear Scaling: SSMs scale linearly with sequence length, while traditional transformers scale quadratically. This enables efficient long-context processing.

Selective Attention: Only critical tokens trigger full attention mechanisms, reducing computational overhead.

Memory Efficiency: The 2B active parameter design means most of the 24B parameters remain dormant during inference, drastically reducing memory bandwidth requirements.

Fast Sequential Processing: SSMs excel at sequential tasks like text generation, achieving higher throughput than pure attention mechanisms.

Tips for Clore.ai Users

  • Single GPU Focused: LFM2-24B-A2B is optimized for single-GPU deployment. Multi-GPU setups don't provide significant benefits.

  • Context Length: Use shorter contexts (8K-16K) for maximum speed. Longer contexts reduce the SSM efficiency advantage.

  • Temperature Settings: Lower temperatures (0.1-0.3) maximize inference speed by reducing uncertainty.

  • Batch Size: Increase batch size for multiple concurrent requests rather than using multiple GPUs.

  • License Compliance: Verify commercial licensing requirements with Liquid AI before production deployment.

Troubleshooting

Issue
Solution

ImportError: liquid_transformers

Install: pip install git+https://github.com/LiquidAI-project/liquid-transformers.git

Slow startup

Pre-download: huggingface-cli download liquid-ai/LFM2-24B-A2B

OutOfMemoryError

Use quantized version or reduce max-model-len

Poor quality responses

Check license restrictions — some model versions have limited capabilities

SSM layer errors

Update transformers: pip install transformers>=4.45.0

Performance Comparison

Model
Active Params
VRAM (FP16)
Speed (RTX 4090)

Llama 3.2 3B

3B

~6GB

~280 tok/s

Qwen2.5 7B

7B

~14GB

~180 tok/s

LFM2-24B-A2B

2B

~6GB

~350 tok/s

Mistral 7B

7B

~14GB

~200 tok/s

Phi-3.5 3.8B

3.8B

~8GB

~250 tok/s

LFM2-24B-A2B achieves the best speed-per-VRAM ratio in its class.

Resources

Last updated

Was this helpful?