# MOSS-TTS (CPU-only, 100M)

MOSS-TTS is an open-source speech generation family from **OpenMOSS** (Shanghai Innovation Institution, in collaboration with **Fudan NLP** and **MOSI.AI**, led by Prof. Xipeng Qiu). The flagship **MOSS-TTS-Nano** is just **100M parameters**, runs in real-time on a **4-core CPU with zero GPU**, outputs **48 kHz stereo**, and supports **20 languages** with zero-shot voice cloning. The full family scales up to 8B for multi-speaker dialogue, voice design, and sound-effect generation.

{% hint style="info" %}
**Released:** April 10, 2026 (Nano) · ONNX CPU build April 17, 2026 · **License:** Apache 2.0
{% endhint %}

If Kokoro owns the 82M-param Western-English niche, MOSS-TTS-Nano owns the **CPU-first multilingual** niche: same tiny-model philosophy, but stereo 48 kHz, 20 languages, voice cloning, and a torch-free ONNX/GGUF path. For anyone who wants to ship TTS without paying for a GPU — this is the model.

### MOSS-TTS Family

| Model                          | Size            | VRAM                | Best For                                      |
| ------------------------------ | --------------- | ------------------- | --------------------------------------------- |
| **MOSS-TTS-Nano-100M**         | 100M            | 0 GB (CPU, 4 cores) | Real-time, edge, IVR, on-device               |
| **MOSS-TTS-Nano-100M-ONNX**    | 100M            | 0 GB (CPU)          | Torch-free production serving                 |
| **MOSS-TTS-GGUF**              | 100M (Q4\_K\_M) | 0 GB (CPU)          | llama.cpp-style deployments                   |
| **MOSS-TTS-Local-Transformer** | 1.7B            | 4 GB                | Lightweight GPU, strong objective quality     |
| **MOSS-TTS-Realtime**          | 1.7B            | 4 GB                | Multi-turn voice agents, 180 ms TTFB          |
| **MOSS-VoiceGenerator**        | 1.7B            | 4 GB                | Voice design from text prompts                |
| **MOSS-TTSD-v1.0**             | 8B              | 8 GB                | Multi-speaker dialogue, long podcasts         |
| **MOSS-SoundEffect**           | 8B              | 8 GB                | Sound effect generation with duration control |

### Key Specs

| Spec              | Value                                                                                                                           |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| **Developer**     | OpenMOSS Team · MOSI.AI · Fudan NLP Lab                                                                                         |
| **Architecture**  | Autoregressive (Audio Tokenizer + LLM)                                                                                          |
| **Sample rate**   | 48 kHz, stereo                                                                                                                  |
| **Languages**     | 20 (zh, en, de, es, fr, ja, it, hu, ko, ru, fa, ar, pl, pt, cs, da, sv, el, tr, +1)                                             |
| **Voice cloning** | Zero-shot from \~3s reference                                                                                                   |
| **Streaming**     | Yes — chunked decode on CPU                                                                                                     |
| **License**       | Apache 2.0                                                                                                                      |
| **HuggingFace**   | [OpenMOSS-Team](https://huggingface.co/OpenMOSS-Team)                                                                           |
| **GitHub**        | [OpenMOSS/MOSS-TTS-Nano](https://github.com/OpenMOSS/MOSS-TTS-Nano) · [OpenMOSS/MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) |

### Why MOSS-TTS?

* **Zero-GPU deployment** — Nano runs on 4 CPU cores, no CUDA, no Triton
* **48 kHz stereo output** — broadcast-grade, rare in sub-100M models
* **20 languages** — more coverage than Kokoro (\~5) at similar size
* **Zero-shot voice cloning** from \~3s reference audio
* **Torch-free ONNX/GGUF paths** — ship with a 200 MB binary
* **Family scales up** — same tokenizer/API from Nano to 8B TTSD
* **Apache 2.0** — commercial use, no strings
* **From serious research** — Fudan NLP + MOSI.AI, not a hobby project

## Requirements

| Component | Minimum (Nano, CPU)       | Recommended (Nano, CPU) | Full Family (GPU) |
| --------- | ------------------------- | ----------------------- | ----------------- |
| CPU       | 4 cores (x86\_64 / ARM64) | 8 cores                 | 8 cores           |
| RAM       | 4 GB                      | 8 GB                    | 16 GB             |
| GPU       | — (not required)          | — (optional)            | RTX 3060 12 GB+   |
| VRAM      | 0 GB                      | 0 GB                    | 4–8 GB            |
| Disk      | 1 GB                      | 2 GB                    | 10 GB (8B + deps) |
| Python    | 3.12                      | 3.12                    | 3.12              |

{% hint style="success" %}
**Clore.ai tip:** Nano literally does not need a GPU. If you already have a Clore box for other work, TTS is free. If you *want* a GPU for batch throughput or to run the 1.7B/8B variants, an **RTX 3060 12GB (\~$0.10–0.30/day)** is overkill.
{% endhint %}

## Option A — Python install + quick inference

```bash
conda create -n moss-tts-nano python=3.12 -y
conda activate moss-tts-nano

git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
cd MOSS-TTS-Nano
pip install -r requirements.txt
pip install -e .

# If pynini breaks on pip, use conda-forge:
conda install -c conda-forge pynini=2.1.6.post1 -y
```

Inference from the reference audio + target text:

```bash
python infer.py \
  --prompt-audio-path assets/audio/en_1.wav \
  --text "Welcome to Clore.ai — the decentralized GPU marketplace."
# Output: generated_audio/infer_output.wav  (48 kHz stereo)
```

Or via the CLI entrypoint:

```bash
moss-tts-nano generate \
  --prompt-speech ref.wav \
  --text "Hello from MOSS-TTS Nano running on CPU."
```

Web demo (Gradio):

```bash
python app.py
# → http://127.0.0.1:18083
```

## Option B — Docker (CPU and GPU)

**CPU-only** (Nano, \~1 GB image):

```dockerfile
FROM python:3.12-slim
RUN apt-get update && apt-get install -y git build-essential \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /app
RUN git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git . \
    && pip install -r requirements.txt && pip install -e .
EXPOSE 18083
CMD ["python", "app.py"]
```

```bash
docker build -t moss-tts-nano-cpu .
docker run --rm -p 18083:18083 moss-tts-nano-cpu
```

**GPU variant** (for Realtime / TTSD / SoundEffect):

```bash
docker run --gpus all -p 18083:18083 \
  pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime \
  bash -c "git clone https://github.com/OpenMOSS/MOSS-TTS.git /app \
           && cd /app \
           && pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e '.[torch-runtime]' \
           && python app.py"
```

## Option C — Zero-shot voice cloning (3s reference)

MOSS-TTS-Nano clones a voice from a short reference clip and handles long-form synthesis via automatic chunking.

```python
from moss_tts_nano import MossTTSNano
import soundfile as sf

model = MossTTSNano.from_pretrained("OpenMOSS-Team/MOSS-TTS-Nano-100M")

# Clone voice from any 3–10s clean clip
audio, sr = model.synthesize(
    text="This is my cloned voice narrating a Clore.ai audiobook chapter.",
    prompt_audio_path="speaker_ref_3s.wav",
    language="en",
)
sf.write("cloned.wav", audio, sr)  # 48 kHz stereo
```

**Quality tips (ported from the XTTS playbook — same principles apply):**

* Use 3–10s of **clean** reference (no background music, no room reverb)
* Match the language of reference and target text when possible
* Normalize and trim silence before passing in (`librosa.effects.trim`)
* For consistent long-form narration, reuse the same reference across calls

## Option D — GGUF on llama.cpp-audio / torch-free ONNX

For edge boxes, mobile backends, or anywhere you do not want PyTorch:

```bash
# Clone the main repo with the torch-free extras
git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
pip install -e ".[llama-cpp-onnx]"

# Pull GGUF quantized weights (Q4_K_M)
huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/

# Or pure ONNX build (no torch at all)
huggingface-cli download OpenMOSS-Team/MOSS-TTS-Nano-100M-ONNX --local-dir weights-onnx/
```

This path runs on llama.cpp-compatible tooling — great for Raspberry Pi, Android, or serverless functions where a 200 MB binary matters.

## Clore.ai GPU Recommendations

**You do not need a GPU for Nano.** That is the whole point. But if you want to batch-generate or run the bigger siblings:

| GPU                   | VRAM  | Fits                                | Clore price (approx) |
| --------------------- | ----- | ----------------------------------- | -------------------- |
| **CPU-only instance** | —     | Nano, Nano-ONNX, GGUF               | from **$0.01/hr**    |
| RTX 3060 12GB         | 12 GB | Nano + Local-Transformer + Realtime | from $0.10/day       |
| RTX 3090 24GB         | 24 GB | Full TTSD-v1.0 (8B), batch serving  | from $0.30/day       |
| RTX 4090 24GB         | 24 GB | TTSD + SoundEffect concurrent       | from $0.50/day       |

{% hint style="success" %}
For 90% of production TTS workloads — voice agents, IVR, narration — a **CPU-only Clore.ai box is literally the cheapest viable deployment**. Rent it, run MOSS-TTS-Nano, forget about GPU bills.
{% endhint %}

## Use Cases

* **Audiobooks** — long-form narration with consistent cloned voice, automatic chunking
* **Voice agents** — sub-second TTFB on Realtime variant for conversational AI
* **IVR / phone systems** — CPU-only deploy, 48 kHz stereo, 20 languages
* **Game NPCs** — lightweight enough to ship inside a game client, voice design per character
* **Dubbing** — multilingual cloning for localization pipelines
* **Podcast generation** — MOSS-TTSD-v1.0 handles multi-speaker dialogue natively
* **Sound effects** — MOSS-SoundEffect adds duration-controlled FX to the pipeline

## Benchmarks / Quality

* **MOSS-TTSD-v1.0** outperformed Doubao and Gemini 2.5-pro on subjective multi-speaker dialogue evals
* **Nano** delivers real-time factor **< 1.0 on 4 CPU cores** (i.e. faster than playback)
* **Realtime** variant reports **\~180 ms time-to-first-byte** for conversational use
* Stereo 48 kHz output is a clear step up from 24 kHz mono competitors at this param budget

## Troubleshooting

| Problem                           | Solution                                                                                 |
| --------------------------------- | ---------------------------------------------------------------------------------------- |
| `pynini` install fails via pip    | `conda install -c conda-forge pynini=2.1.6.post1 -y` then reinstall WeTextProcessing     |
| Choppy audio on CPU               | Ensure 4+ physical cores; disable SMT/HT oversubscription; use ONNX build                |
| Cloned voice sounds off           | Reference must be 3–10s, clean, single speaker, language-matched                         |
| OOM on TTSD-v1.0                  | Use FP16 (`model.half()`) or drop down to the 1.7B Local-Transformer                     |
| Model download stalls             | Set `HF_HUB_ENABLE_HF_TRANSFER=1` and retry                                              |
| Slow first run                    | First inference compiles kernels / downloads \~400 MB weights — subsequent runs are fast |
| Torch conflicts with other models | Use the `[llama-cpp-onnx]` extras for a torch-free environment                           |

## Next Steps

* [Kokoro TTS](/guides/audio-and-voice/kokoro-tts.md) — the 82M English-first alternative if you do not need multilingual
* [Voxtral TTS](/guides/audio-and-voice/voxtral-tts.md) — 4B Mistral model, 9 languages, GPU-required but higher ceiling
* [XTTS (Coqui)](/guides/audio-and-voice/xtts-coqui.md) — 17-language voice cloning, GPU-only, larger
* [Whisper Transcription](/guides/audio-and-voice/whisper-transcription.md) — pair MOSS-TTS with Whisper for full voice pipelines
* [Rent a GPU (or CPU) on Clore.ai Marketplace](https://clore.ai/marketplace)

***

*Last updated: April 20, 2026*


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/moss-tts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.