MOSS-TTS (CPU-only, 100M)
Run MOSS-TTS — ultra-lightweight 100M-parameter CPU-first multilingual text-to-speech from OpenMOSS (MOSI.AI + Fudan NLP) on Clore.ai.
MOSS-TTS is an open-source speech generation family from OpenMOSS (Shanghai Innovation Institution, in collaboration with Fudan NLP and MOSI.AI, led by Prof. Xipeng Qiu). The flagship MOSS-TTS-Nano is just 100M parameters, runs in real-time on a 4-core CPU with zero GPU, outputs 48 kHz stereo, and supports 20 languages with zero-shot voice cloning. The full family scales up to 8B for multi-speaker dialogue, voice design, and sound-effect generation.
Released: April 10, 2026 (Nano) · ONNX CPU build April 17, 2026 · License: Apache 2.0
If Kokoro owns the 82M-param Western-English niche, MOSS-TTS-Nano owns the CPU-first multilingual niche: same tiny-model philosophy, but stereo 48 kHz, 20 languages, voice cloning, and a torch-free ONNX/GGUF path. For anyone who wants to ship TTS without paying for a GPU — this is the model.
MOSS-TTS Family
MOSS-TTS-Nano-100M
100M
0 GB (CPU, 4 cores)
Real-time, edge, IVR, on-device
MOSS-TTS-Nano-100M-ONNX
100M
0 GB (CPU)
Torch-free production serving
MOSS-TTS-GGUF
100M (Q4_K_M)
0 GB (CPU)
llama.cpp-style deployments
MOSS-TTS-Local-Transformer
1.7B
4 GB
Lightweight GPU, strong objective quality
MOSS-TTS-Realtime
1.7B
4 GB
Multi-turn voice agents, 180 ms TTFB
MOSS-VoiceGenerator
1.7B
4 GB
Voice design from text prompts
MOSS-TTSD-v1.0
8B
8 GB
Multi-speaker dialogue, long podcasts
MOSS-SoundEffect
8B
8 GB
Sound effect generation with duration control
Key Specs
Developer
OpenMOSS Team · MOSI.AI · Fudan NLP Lab
Architecture
Autoregressive (Audio Tokenizer + LLM)
Sample rate
48 kHz, stereo
Languages
20 (zh, en, de, es, fr, ja, it, hu, ko, ru, fa, ar, pl, pt, cs, da, sv, el, tr, +1)
Voice cloning
Zero-shot from ~3s reference
Streaming
Yes — chunked decode on CPU
License
Apache 2.0
HuggingFace
Why MOSS-TTS?
Zero-GPU deployment — Nano runs on 4 CPU cores, no CUDA, no Triton
48 kHz stereo output — broadcast-grade, rare in sub-100M models
20 languages — more coverage than Kokoro (~5) at similar size
Zero-shot voice cloning from ~3s reference audio
Torch-free ONNX/GGUF paths — ship with a 200 MB binary
Family scales up — same tokenizer/API from Nano to 8B TTSD
Apache 2.0 — commercial use, no strings
From serious research — Fudan NLP + MOSI.AI, not a hobby project
Requirements
CPU
4 cores (x86_64 / ARM64)
8 cores
8 cores
RAM
4 GB
8 GB
16 GB
GPU
— (not required)
— (optional)
RTX 3060 12 GB+
VRAM
0 GB
0 GB
4–8 GB
Disk
1 GB
2 GB
10 GB (8B + deps)
Python
3.12
3.12
3.12
Clore.ai tip: Nano literally does not need a GPU. If you already have a Clore box for other work, TTS is free. If you want a GPU for batch throughput or to run the 1.7B/8B variants, an RTX 3060 12GB (~$0.10–0.30/day) is overkill.
Option A — Python install + quick inference
Inference from the reference audio + target text:
Or via the CLI entrypoint:
Web demo (Gradio):
Option B — Docker (CPU and GPU)
CPU-only (Nano, ~1 GB image):
GPU variant (for Realtime / TTSD / SoundEffect):
Option C — Zero-shot voice cloning (3s reference)
MOSS-TTS-Nano clones a voice from a short reference clip and handles long-form synthesis via automatic chunking.
Quality tips (ported from the XTTS playbook — same principles apply):
Use 3–10s of clean reference (no background music, no room reverb)
Match the language of reference and target text when possible
Normalize and trim silence before passing in (
librosa.effects.trim)For consistent long-form narration, reuse the same reference across calls
Option D — GGUF on llama.cpp-audio / torch-free ONNX
For edge boxes, mobile backends, or anywhere you do not want PyTorch:
This path runs on llama.cpp-compatible tooling — great for Raspberry Pi, Android, or serverless functions where a 200 MB binary matters.
Clore.ai GPU Recommendations
You do not need a GPU for Nano. That is the whole point. But if you want to batch-generate or run the bigger siblings:
CPU-only instance
—
Nano, Nano-ONNX, GGUF
from $0.01/hr
RTX 3060 12GB
12 GB
Nano + Local-Transformer + Realtime
from $0.10/day
RTX 3090 24GB
24 GB
Full TTSD-v1.0 (8B), batch serving
from $0.30/day
RTX 4090 24GB
24 GB
TTSD + SoundEffect concurrent
from $0.50/day
For 90% of production TTS workloads — voice agents, IVR, narration — a CPU-only Clore.ai box is literally the cheapest viable deployment. Rent it, run MOSS-TTS-Nano, forget about GPU bills.
Use Cases
Audiobooks — long-form narration with consistent cloned voice, automatic chunking
Voice agents — sub-second TTFB on Realtime variant for conversational AI
IVR / phone systems — CPU-only deploy, 48 kHz stereo, 20 languages
Game NPCs — lightweight enough to ship inside a game client, voice design per character
Dubbing — multilingual cloning for localization pipelines
Podcast generation — MOSS-TTSD-v1.0 handles multi-speaker dialogue natively
Sound effects — MOSS-SoundEffect adds duration-controlled FX to the pipeline
Benchmarks / Quality
MOSS-TTSD-v1.0 outperformed Doubao and Gemini 2.5-pro on subjective multi-speaker dialogue evals
Nano delivers real-time factor < 1.0 on 4 CPU cores (i.e. faster than playback)
Realtime variant reports ~180 ms time-to-first-byte for conversational use
Stereo 48 kHz output is a clear step up from 24 kHz mono competitors at this param budget
Troubleshooting
pynini install fails via pip
conda install -c conda-forge pynini=2.1.6.post1 -y then reinstall WeTextProcessing
Choppy audio on CPU
Ensure 4+ physical cores; disable SMT/HT oversubscription; use ONNX build
Cloned voice sounds off
Reference must be 3–10s, clean, single speaker, language-matched
OOM on TTSD-v1.0
Use FP16 (model.half()) or drop down to the 1.7B Local-Transformer
Model download stalls
Set HF_HUB_ENABLE_HF_TRANSFER=1 and retry
Slow first run
First inference compiles kernels / downloads ~400 MB weights — subsequent runs are fast
Torch conflicts with other models
Use the [llama-cpp-onnx] extras for a torch-free environment
Next Steps
Kokoro TTS — the 82M English-first alternative if you do not need multilingual
Voxtral TTS — 4B Mistral model, 9 languages, GPU-required but higher ceiling
XTTS (Coqui) — 17-language voice cloning, GPU-only, larger
Whisper Transcription — pair MOSS-TTS with Whisper for full voice pipelines
Last updated: April 20, 2026
Last updated
Was this helpful?