For the complete documentation index, see llms.txt. This page is also available as Markdown.

MOSS-TTS (CPU-only, 100M)

Run MOSS-TTS — ultra-lightweight 100M-parameter CPU-first multilingual text-to-speech from OpenMOSS (MOSI.AI + Fudan NLP) on Clore.ai.

MOSS-TTS is an open-source speech generation family from OpenMOSS (Shanghai Innovation Institution, in collaboration with Fudan NLP and MOSI.AI, led by Prof. Xipeng Qiu). The flagship MOSS-TTS-Nano is just 100M parameters, runs in real-time on a 4-core CPU with zero GPU, outputs 48 kHz stereo, and supports 20 languages with zero-shot voice cloning. The full family scales up to 8B for multi-speaker dialogue, voice design, and sound-effect generation.

Released: April 10, 2026 (Nano) · ONNX CPU build April 17, 2026 · License: Apache 2.0

If Kokoro owns the 82M-param Western-English niche, MOSS-TTS-Nano owns the CPU-first multilingual niche: same tiny-model philosophy, but stereo 48 kHz, 20 languages, voice cloning, and a torch-free ONNX/GGUF path. For anyone who wants to ship TTS without paying for a GPU — this is the model.

MOSS-TTS Family

Model
Size
VRAM
Best For

MOSS-TTS-Nano-100M

100M

0 GB (CPU, 4 cores)

Real-time, edge, IVR, on-device

MOSS-TTS-Nano-100M-ONNX

100M

0 GB (CPU)

Torch-free production serving

MOSS-TTS-GGUF

100M (Q4_K_M)

0 GB (CPU)

llama.cpp-style deployments

MOSS-TTS-Local-Transformer

1.7B

4 GB

Lightweight GPU, strong objective quality

MOSS-TTS-Realtime

1.7B

4 GB

Multi-turn voice agents, 180 ms TTFB

MOSS-VoiceGenerator

1.7B

4 GB

Voice design from text prompts

MOSS-TTSD-v1.0

8B

8 GB

Multi-speaker dialogue, long podcasts

MOSS-SoundEffect

8B

8 GB

Sound effect generation with duration control

Key Specs

Spec
Value

Developer

OpenMOSS Team · MOSI.AI · Fudan NLP Lab

Architecture

Autoregressive (Audio Tokenizer + LLM)

Sample rate

48 kHz, stereo

Languages

20 (zh, en, de, es, fr, ja, it, hu, ko, ru, fa, ar, pl, pt, cs, da, sv, el, tr, +1)

Voice cloning

Zero-shot from ~3s reference

Streaming

Yes — chunked decode on CPU

License

Apache 2.0

HuggingFace

Why MOSS-TTS?

  • Zero-GPU deployment — Nano runs on 4 CPU cores, no CUDA, no Triton

  • 48 kHz stereo output — broadcast-grade, rare in sub-100M models

  • 20 languages — more coverage than Kokoro (~5) at similar size

  • Zero-shot voice cloning from ~3s reference audio

  • Torch-free ONNX/GGUF paths — ship with a 200 MB binary

  • Family scales up — same tokenizer/API from Nano to 8B TTSD

  • Apache 2.0 — commercial use, no strings

  • From serious research — Fudan NLP + MOSI.AI, not a hobby project

Requirements

Component
Minimum (Nano, CPU)
Recommended (Nano, CPU)
Full Family (GPU)

CPU

4 cores (x86_64 / ARM64)

8 cores

8 cores

RAM

4 GB

8 GB

16 GB

GPU

— (not required)

— (optional)

RTX 3060 12 GB+

VRAM

0 GB

0 GB

4–8 GB

Disk

1 GB

2 GB

10 GB (8B + deps)

Python

3.12

3.12

3.12

Option A — Python install + quick inference

Inference from the reference audio + target text:

Or via the CLI entrypoint:

Web demo (Gradio):

Option B — Docker (CPU and GPU)

CPU-only (Nano, ~1 GB image):

GPU variant (for Realtime / TTSD / SoundEffect):

Option C — Zero-shot voice cloning (3s reference)

MOSS-TTS-Nano clones a voice from a short reference clip and handles long-form synthesis via automatic chunking.

Quality tips (ported from the XTTS playbook — same principles apply):

  • Use 3–10s of clean reference (no background music, no room reverb)

  • Match the language of reference and target text when possible

  • Normalize and trim silence before passing in (librosa.effects.trim)

  • For consistent long-form narration, reuse the same reference across calls

Option D — GGUF on llama.cpp-audio / torch-free ONNX

For edge boxes, mobile backends, or anywhere you do not want PyTorch:

This path runs on llama.cpp-compatible tooling — great for Raspberry Pi, Android, or serverless functions where a 200 MB binary matters.

Clore.ai GPU Recommendations

You do not need a GPU for Nano. That is the whole point. But if you want to batch-generate or run the bigger siblings:

GPU
VRAM
Fits
Clore price (approx)

CPU-only instance

Nano, Nano-ONNX, GGUF

from $0.01/hr

RTX 3060 12GB

12 GB

Nano + Local-Transformer + Realtime

from $0.10/day

RTX 3090 24GB

24 GB

Full TTSD-v1.0 (8B), batch serving

from $0.30/day

RTX 4090 24GB

24 GB

TTSD + SoundEffect concurrent

from $0.50/day

Use Cases

  • Audiobooks — long-form narration with consistent cloned voice, automatic chunking

  • Voice agents — sub-second TTFB on Realtime variant for conversational AI

  • IVR / phone systems — CPU-only deploy, 48 kHz stereo, 20 languages

  • Game NPCs — lightweight enough to ship inside a game client, voice design per character

  • Dubbing — multilingual cloning for localization pipelines

  • Podcast generation — MOSS-TTSD-v1.0 handles multi-speaker dialogue natively

  • Sound effects — MOSS-SoundEffect adds duration-controlled FX to the pipeline

Benchmarks / Quality

  • MOSS-TTSD-v1.0 outperformed Doubao and Gemini 2.5-pro on subjective multi-speaker dialogue evals

  • Nano delivers real-time factor < 1.0 on 4 CPU cores (i.e. faster than playback)

  • Realtime variant reports ~180 ms time-to-first-byte for conversational use

  • Stereo 48 kHz output is a clear step up from 24 kHz mono competitors at this param budget

Troubleshooting

Problem
Solution

pynini install fails via pip

conda install -c conda-forge pynini=2.1.6.post1 -y then reinstall WeTextProcessing

Choppy audio on CPU

Ensure 4+ physical cores; disable SMT/HT oversubscription; use ONNX build

Cloned voice sounds off

Reference must be 3–10s, clean, single speaker, language-matched

OOM on TTSD-v1.0

Use FP16 (model.half()) or drop down to the 1.7B Local-Transformer

Model download stalls

Set HF_HUB_ENABLE_HF_TRANSFER=1 and retry

Slow first run

First inference compiles kernels / downloads ~400 MB weights — subsequent runs are fast

Torch conflicts with other models

Use the [llama-cpp-onnx] extras for a torch-free environment

Next Steps


Last updated: April 20, 2026

Last updated

Was this helpful?