# Qwen3.5-Omni (Multimodal)

Alibaba's **Qwen3.5-Omni** is a unified end-to-end multimodal model released on March 30, 2026 under the Apache 2.0 license. It can understand and reason across text, audio, images, and video simultaneously — and generate both text and speech as output. Running it on a rented Clore.ai GPU gives you a production-grade multimodal assistant at a fraction of cloud API costs.

***

## What Is Qwen3.5-Omni?

Qwen3.5-Omni is an **end-to-end multimodal model** built on a sparse Mixture-of-Experts architecture. The HuggingFace release (`Qwen3.5-Omni-7B`) uses Alibaba's naming convention where "7B" refers to the active-parameter configuration per inference step; the full checkpoint includes all expert weights. That sparsity is what makes it deployable on a single RTX 4090 (24 GB) using INT4 quantization — a model that would otherwise require far more VRAM at full precision.

### Key Capabilities

| Modality | Input                            | Output               |
| -------- | -------------------------------- | -------------------- |
| Text     | ✅                                | ✅                    |
| Audio    | ✅ (transcription, understanding) | ✅ (speech synthesis) |
| Image    | ✅ (understanding, OCR, analysis) | —                    |
| Video    | ✅ (scene understanding, QA)      | —                    |

Unlike previous multimodal models that bolt together separate encoders, Qwen3.5-Omni processes all modalities in a single unified forward pass. It can simultaneously transcribe spoken audio, analyze a video frame, and respond with both text and a synthesized voice — in one inference call.

### Architecture Highlights

* **Gated Delta Networks (GDN)** for efficient sequence modeling with subquadratic complexity on long audio/video streams
* **Sparse Mixture-of-Experts** — 30B total params, \~3B active per token; comparable quality to 7–14B dense models but faster at scale
* **Unified tokenizer** covering text, audio frames, image patches, and video frame sequences
* **Built-in TTS decoder** — generates speech waveforms natively rather than through a separate pipeline

Released March 30, 2026 · License: **Apache 2.0** · [HuggingFace](https://huggingface.co/Qwen/Qwen3.5-Omni-7B)

***

## Qwen3.5-Omni vs. Related Models

| Model               | Params              | Modalities In             | Speech Out | License     | VRAM (INT4) |
| ------------------- | ------------------- | ------------------------- | ---------- | ----------- | ----------- |
| **Qwen3.5-Omni**    | 30B MoE (3B active) | Text, Audio, Image, Video | ✅          | Apache 2.0  | \~15 GB     |
| Qwen3.5 (text-only) | 32B                 | Text only                 | ❌          | Apache 2.0  | \~18 GB     |
| Qwen2.5-VL          | 72B                 | Text, Image, Video        | ❌          | Apache 2.0  | \~40 GB     |
| Gemini 2.0 Flash    | —                   | Text, Audio, Image, Video | ✅          | Proprietary | API only    |

Compared to **Qwen3.5 (text-only)**, the Omni variant adds audio/video understanding and speech generation while actually requiring *less* VRAM at INT4 thanks to the MoE architecture. Compared to **Qwen2.5-VL**, it adds audio I/O but requires far less hardware.

***

## Hardware Requirements

| Precision      | VRAM Required | Recommended GPU          |
| -------------- | ------------- | ------------------------ |
| BF16 (full)    | 64–80 GB      | A100 80GB, H100          |
| BF16 multi-GPU | 2× 40 GB      | 2× A40 / 2× A6000        |
| INT4 / GGUF    | \~15 GB       | RTX 4090 (24 GB) ✅       |
| INT8           | \~30 GB       | A6000 48GB, RTX 6000 Ada |

For most self-hosted use cases, **INT4 on an RTX 4090** is the sweet spot: full multimodal capability at $0.50–0.80/day on Clore.ai.

***

## Quick Start on Clore.ai

### Step 1: Rent a GPU

Go to [clore.ai/marketplace](https://clore.ai/marketplace) and rent:

* **INT4 / Single-GPU**: RTX 4090 (24 GB) — from **\~$0.50/day**
* **BF16 / Full Precision**: A100 80GB or H100 — from **\~$2.50/day**

Use the **vllm/vllm-openai** Docker image or the standard CUDA image.

### Step 2: Deploy with vLLM (Recommended)

vLLM v0.17.0+ is required for Qwen3.5-Omni support.

```bash
# Pull and run the vLLM OpenAI-compatible server
docker run --gpus all --rm -it \
  -p 8000:8000 \
  -v /workspace/models:/root/.cache/huggingface \
  vllm/vllm-openai:v0.17.0 \
  --model Qwen/Qwen3.5-Omni-7B \
  --quantization awq_marlin \
  --max-model-len 32768 \
  --trust-remote-code
```

> **Note:** The `awq_marlin` flag requires a pre-quantized AWQ model. Download `Qwen/Qwen3.5-Omni-7B-AWQ` instead of the base model, or omit `--quantization` for BF16 on A100/H100.

Once the server is running, it exposes an OpenAI-compatible API at `http://localhost:8000/v1`.

### Step 3: Deploy with Ollama (Simpler Setup)

For quick experimentation without Docker complexity:

```bash
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull Qwen3.5-Omni (quantized)
# Note: check https://ollama.com/library for availability — tag may vary
ollama pull qwen3.5-omni

# Start the server
ollama serve
```

Ollama handles quantization automatically and provides a simple `/api/generate` endpoint.

***

## Example API Calls

### Multimodal Input: Image + Text

```python
import openai
import base64

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# Load an image
with open("screenshot.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-Omni-7B",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{image_b64}"}
                },
                {
                    "type": "text",
                    "text": "Describe what you see in this image and identify any text."
                }
            ]
        }
    ],
    max_tokens=512
)
print(response.choices[0].message.content)
```

### Audio Transcription + Understanding

```python
import openai
import base64

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

with open("meeting_recording.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-Omni-7B",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {"url": f"data:audio/wav;base64,{audio_b64}"}
                },
                {
                    "type": "text",
                    "text": "Transcribe this audio and summarize the key points."
                }
            ]
        }
    ]
)
print(response.choices[0].message.content)
```

### Video Understanding

```python
# Video frames can be passed as a sequence of image URLs
# or as a video_url when using the Qwen3.5-Omni native API
response = client.chat.completions.create(
    model="Qwen/Qwen3.5-Omni-7B",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {"url": "https://example.com/product-demo.mp4"}
                },
                {
                    "type": "text",
                    "text": "What is happening in this video? Describe each scene."
                }
            ]
        }
    ]
)
```

***

## Multi-GPU Setup for BF16

If you rent a multi-GPU machine on Clore.ai (e.g., 2× A40 or 2× A6000), use tensor parallelism:

```bash
docker run --gpus all --rm -it \
  -p 8000:8000 \
  -v /workspace/models:/root/.cache/huggingface \
  vllm/vllm-openai:v0.17.0 \
  --model Qwen/Qwen3.5-Omni-7B \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --max-model-len 65536 \
  --trust-remote-code
```

This splits the model across both GPUs for maximum throughput and quality.

***

## Use Cases

### 1. Customer Service Automation

Qwen3.5-Omni can listen to customer voice calls, transcribe them in real-time, understand the issue, and generate both a text summary and a spoken response. All in one model, no stitching together separate ASR + LLM + TTS pipelines.

### 2. Video Content Understanding

Upload product demo videos, lecture recordings, or surveillance footage and get detailed text descriptions, timestamped summaries, or Q\&A. The model handles up to 32K tokens of context, covering multi-minute videos.

### 3. Real-Time Voice Agents

Build conversational voice assistants that understand context across audio turns. Qwen3.5-Omni maintains conversational memory and can interleave its text reasoning with speech generation — ideal for phone-based customer support bots.

### 4. Document + Screenshot Analysis

OCR, layout understanding, chart interpretation — pass in screenshots of dashboards, PDFs, or handwritten notes and get structured text output or detailed analysis.

### 5. Multilingual Audio Processing

The model supports 29 languages for both text and speech, making it suitable for international customer support, multilingual transcription pipelines, and cross-lingual video analysis.

***

## Cost Estimate on Clore.ai

| GPU          | Precision            | VRAM    | Price/Day | Best For                             |
| ------------ | -------------------- | ------- | --------- | ------------------------------------ |
| RTX 4090     | INT4                 | 24 GB   | \~$0.50   | Dev, testing, small-scale production |
| RTX 6000 Ada | INT8                 | 48 GB   | \~$1.20   | Better quality, moderate throughput  |
| A100 80GB    | BF16                 | 80 GB   | \~$2.50   | Full quality, high throughput        |
| 2× A40       | BF16 tensor parallel | 2×48 GB | \~$2.00   | Full quality, cost-efficient         |

Running Qwen3.5-Omni at INT4 on an RTX 4090 costs less per day than a single OpenAI API call for a complex multimodal task at scale.

***

## Tips & Troubleshooting

**"CUDA out of memory" on RTX 4090**

* Add `--gpu-memory-utilization 0.90` to the vLLM command
* Reduce `--max-model-len` to 16384 if processing short inputs

**Audio input not working**

* Ensure vLLM version is exactly `v0.17.0` or newer — earlier versions lack Omni audio support
* WAV files must be 16kHz mono for best results; use `ffmpeg -ar 16000 -ac 1` to convert

**Slow first inference**

* vLLM compiles CUDA kernels on first run; warmup takes 2–5 minutes. Subsequent calls are fast.

**Ollama not recognizing video input**

* Ollama currently supports image+text and audio only; for video understanding use the vLLM deployment.

***

## Summary

Qwen3.5-Omni brings true end-to-end multimodal AI — text, audio, image, and video in, text and speech out — to a single open-source model that runs on consumer hardware. At INT4, it fits in a 24 GB RTX 4090 and costs under a dollar a day on Clore.ai. With Apache 2.0 licensing and OpenAI-compatible API via vLLM, it drops directly into existing pipelines.

**→** [**Rent an RTX 4090 on Clore.ai**](https://clore.ai/marketplace) and deploy Qwen3.5-Omni today.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/qwen35-omni.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
