For the complete documentation index, see llms.txt. This page is also available as Markdown.

Qwen3.5-Omni (Multimodal)

Alibaba's Qwen3.5-Omni is a unified end-to-end multimodal model released on March 30, 2026 under the Apache 2.0 license. It can understand and reason across text, audio, images, and video simultaneously — and generate both text and speech as output. Running it on a rented Clore.ai GPU gives you a production-grade multimodal assistant at a fraction of cloud API costs.


What Is Qwen3.5-Omni?

Qwen3.5-Omni is an end-to-end multimodal model built on a sparse Mixture-of-Experts architecture. The HuggingFace release (Qwen3.5-Omni-7B) uses Alibaba's naming convention where "7B" refers to the active-parameter configuration per inference step; the full checkpoint includes all expert weights. That sparsity is what makes it deployable on a single RTX 4090 (24 GB) using INT4 quantization — a model that would otherwise require far more VRAM at full precision.

Key Capabilities

Modality
Input
Output

Text

Audio

✅ (transcription, understanding)

✅ (speech synthesis)

Image

✅ (understanding, OCR, analysis)

Video

✅ (scene understanding, QA)

Unlike previous multimodal models that bolt together separate encoders, Qwen3.5-Omni processes all modalities in a single unified forward pass. It can simultaneously transcribe spoken audio, analyze a video frame, and respond with both text and a synthesized voice — in one inference call.

Architecture Highlights

  • Gated Delta Networks (GDN) for efficient sequence modeling with subquadratic complexity on long audio/video streams

  • Sparse Mixture-of-Experts — 30B total params, ~3B active per token; comparable quality to 7–14B dense models but faster at scale

  • Unified tokenizer covering text, audio frames, image patches, and video frame sequences

  • Built-in TTS decoder — generates speech waveforms natively rather than through a separate pipeline

Released March 30, 2026 · License: Apache 2.0 · HuggingFace


Model
Params
Modalities In
Speech Out
License
VRAM (INT4)

Qwen3.5-Omni

30B MoE (3B active)

Text, Audio, Image, Video

Apache 2.0

~15 GB

Qwen3.5 (text-only)

32B

Text only

Apache 2.0

~18 GB

Qwen2.5-VL

72B

Text, Image, Video

Apache 2.0

~40 GB

Gemini 2.0 Flash

Text, Audio, Image, Video

Proprietary

API only

Compared to Qwen3.5 (text-only), the Omni variant adds audio/video understanding and speech generation while actually requiring less VRAM at INT4 thanks to the MoE architecture. Compared to Qwen2.5-VL, it adds audio I/O but requires far less hardware.


Hardware Requirements

Precision
VRAM Required
Recommended GPU

BF16 (full)

64–80 GB

A100 80GB, H100

BF16 multi-GPU

2× 40 GB

2× A40 / 2× A6000

INT4 / GGUF

~15 GB

RTX 4090 (24 GB) ✅

INT8

~30 GB

A6000 48GB, RTX 6000 Ada

For most self-hosted use cases, INT4 on an RTX 4090 is the sweet spot: full multimodal capability at $0.50–0.80/day on Clore.ai.


Quick Start on Clore.ai

Step 1: Rent a GPU

Go to clore.ai/marketplace and rent:

  • INT4 / Single-GPU: RTX 4090 (24 GB) — from ~$0.50/day

  • BF16 / Full Precision: A100 80GB or H100 — from ~$2.50/day

Use the vllm/vllm-openai Docker image or the standard CUDA image.

vLLM v0.17.0+ is required for Qwen3.5-Omni support.

Note: The awq_marlin flag requires a pre-quantized AWQ model. Download Qwen/Qwen3.5-Omni-7B-AWQ instead of the base model, or omit --quantization for BF16 on A100/H100.

Once the server is running, it exposes an OpenAI-compatible API at http://localhost:8000/v1.

Step 3: Deploy with Ollama (Simpler Setup)

For quick experimentation without Docker complexity:

Ollama handles quantization automatically and provides a simple /api/generate endpoint.


Example API Calls

Multimodal Input: Image + Text

Audio Transcription + Understanding

Video Understanding


Multi-GPU Setup for BF16

If you rent a multi-GPU machine on Clore.ai (e.g., 2× A40 or 2× A6000), use tensor parallelism:

This splits the model across both GPUs for maximum throughput and quality.


Use Cases

1. Customer Service Automation

Qwen3.5-Omni can listen to customer voice calls, transcribe them in real-time, understand the issue, and generate both a text summary and a spoken response. All in one model, no stitching together separate ASR + LLM + TTS pipelines.

2. Video Content Understanding

Upload product demo videos, lecture recordings, or surveillance footage and get detailed text descriptions, timestamped summaries, or Q&A. The model handles up to 32K tokens of context, covering multi-minute videos.

3. Real-Time Voice Agents

Build conversational voice assistants that understand context across audio turns. Qwen3.5-Omni maintains conversational memory and can interleave its text reasoning with speech generation — ideal for phone-based customer support bots.

4. Document + Screenshot Analysis

OCR, layout understanding, chart interpretation — pass in screenshots of dashboards, PDFs, or handwritten notes and get structured text output or detailed analysis.

5. Multilingual Audio Processing

The model supports 29 languages for both text and speech, making it suitable for international customer support, multilingual transcription pipelines, and cross-lingual video analysis.


Cost Estimate on Clore.ai

GPU
Precision
VRAM
Price/Day
Best For

RTX 4090

INT4

24 GB

~$0.50

Dev, testing, small-scale production

RTX 6000 Ada

INT8

48 GB

~$1.20

Better quality, moderate throughput

A100 80GB

BF16

80 GB

~$2.50

Full quality, high throughput

2× A40

BF16 tensor parallel

2×48 GB

~$2.00

Full quality, cost-efficient

Running Qwen3.5-Omni at INT4 on an RTX 4090 costs less per day than a single OpenAI API call for a complex multimodal task at scale.


Tips & Troubleshooting

"CUDA out of memory" on RTX 4090

  • Add --gpu-memory-utilization 0.90 to the vLLM command

  • Reduce --max-model-len to 16384 if processing short inputs

Audio input not working

  • Ensure vLLM version is exactly v0.17.0 or newer — earlier versions lack Omni audio support

  • WAV files must be 16kHz mono for best results; use ffmpeg -ar 16000 -ac 1 to convert

Slow first inference

  • vLLM compiles CUDA kernels on first run; warmup takes 2–5 minutes. Subsequent calls are fast.

Ollama not recognizing video input

  • Ollama currently supports image+text and audio only; for video understanding use the vLLM deployment.


Summary

Qwen3.5-Omni brings true end-to-end multimodal AI — text, audio, image, and video in, text and speech out — to a single open-source model that runs on consumer hardware. At INT4, it fits in a 24 GB RTX 4090 and costs under a dollar a day on Clore.ai. With Apache 2.0 licensing and OpenAI-compatible API via vLLM, it drops directly into existing pipelines.

Rent an RTX 4090 on Clore.ai and deploy Qwen3.5-Omni today.

Last updated

Was this helpful?