by k2-fsa · Apache-2.0 · arXiv:2604.00688
State-of-the-art massively multilingual zero-shot TTS supporting 600+ languages — the broadest language coverage among zero-shot TTS models. Uses a diffusion language model architecture with an 8-layer hierarchical audio codebook for high-fidelity synthesis.
[laughter], [breath] and pronunciation correction via pinyin/phonemesomnivoice-infer), Gradio demoby Resemble AI · MIT License
Family of three open-source TTS models (Chatterbox, Chatterbox-Multilingual, Chatterbox-Turbo) designed for natural speech generation with voice cloning. Built-in Perth watermarking for AI audio detection.
[laugh], [cough], [chuckle] tagspip install chatterbox-tts, HuggingFace| Chatterbox | OmniVoice | |
|---|---|---|
| Model type | Autoregressive (token-by-token) | Masked diffusion (iterative unmasking) |
| Backbone | Custom T3 decoder | Qwen3-0.6B LLM (~600M params) |
| Audio codec | Single codebook stream | 8-layer hierarchical (8×1025 tokens) |
| Streaming | True per-token streaming | No native streaming — full sequence per call |
| Voice cloning | Embedding conditioning | Reference audio tokenization + prefix |
| Languages | Polish + basic multilingual | 600+ languages native |
Chatterbox is real-time because it streams token-by-token — the autoregressive decoder emits tokens sequentially, each decoded to audio and sent immediately. TTFA = time to generate the first few tokens.
OmniVoice uses masked diffusion — all audio tokens start as [MASK] and are iteratively unmasked over N steps. Each step runs a full forward pass over the entire sequence. Tokens are revealed by confidence score, not position. Partial audio cannot be decoded mid-generation.
However, OmniVoice's raw inference speed is so fast (RTF 0.01–0.04) that text-level chunking achieves comparable or better TTFA than Chatterbox's token streaming.
Full generation, voice cloned, no chunking. Total wall time from generate() to returned tensor.
| Text | Chars | Audio | 32 steps | 16 steps | 8 steps |
|---|---|---|---|---|---|
| tiny | 12 | 1.2–1.5s | 269ms (RTF 0.222) | 136ms (RTF 0.094) | 72ms (RTF 0.054) |
| short | 66 | 4.3s | 298ms (RTF 0.069) | 151ms (RTF 0.035) | 80ms (RTF 0.019) |
| medium | 196 | 12.6s | 503ms (RTF 0.040) | 255ms (RTF 0.020) | 137ms (RTF 0.011) |
| long | 488 | 28.5–28.9s | 849ms (RTF 0.030) | 446ms (RTF 0.015) | 255ms (RTF 0.009) |
Simulates streaming: split text at sentence boundaries, generate each chunk independently, measure time to first completed chunk.
| Steps | TTFA (chunk 0) | Chunk 0 audio | Chunk 1 gen | Chunk 1 audio | Total |
|---|---|---|---|---|---|
| 32 | 335ms | 7.00s | 327ms | 5.80s | 662ms |
| 16 | 169ms | 7.00s | 165ms | 5.80s | 333ms |
| 8 | 88ms | 7.00s | 86ms | 5.80s | 173ms |
| Steps | TTFA (chunk 0) | Chunks | Total gen | Total audio | RTF |
|---|---|---|---|---|---|
| 32 | 331ms | 5 | 1605ms | 29.6s | 0.054 |
| 16 | 168ms | 5 | 813ms | 29.6s | 0.027 |
| 8 | 87ms | 5 | 423ms | 29.6s | 0.014 |
Per-chunk breakdown (long text, 16 steps):
| Chunk | Gen time | Audio | Content |
|---|---|---|---|
| 0 | 168ms | 7.00s | "Powazny blad w obiegu dokumentow..." |
| 1 | 165ms | 5.80s | "Przez pomylke dokumentacja..." |
| 2 | 150ms | 4.08s | "Incydent zostal zgloszony..." |
| 3 | 171ms | 7.44s | "Linia lotnicza przeprosila..." |
| 4 | 159ms | 5.32s | "Zwiazki zawodowe domagaja sie..." |
create_voice_clone_prompt() pre-encodes reference audio into reusable tokens.
| Mode | Generation time |
|---|---|
| Raw ref_audio path (re-encodes each call) | 264ms |
| Pre-cached VoiceClonePrompt | 255ms |
| Prompt creation cost | 37ms (one-time) |
| Per-call savings | 9ms (3%) |
Prompt encoding is already fast (37ms). Caching is still worthwhile for a server to avoid redundant re-encoding.
| Mode | Wall time | Per-request | Speedup |
|---|---|---|---|
| Sequential | 767ms | 255ms each | 1.0x |
| 3x concurrent (thread pool + CUDA streams) | 462ms | 436–460ms each | 1.66x |
| Request | Text | Latency | Audio |
|---|---|---|---|
| 0 | tiny (12 chars) | 408ms | 1.51s |
| 1 | medium (196 chars) | 502ms | 12.58s |
| 2 | long (488 chars) | 555ms | 28.61s |
| Wall time | 558ms |
InferenceSlot + SlotPool pattern would improve this further.Isolating overhead (averaged over 10 runs, medium text, 16 steps):
| Stage | Time | % of total |
|---|---|---|
| Token generation (16 steps) | ~245ms | 92% |
| Audio decode (HiggsAudioV2) | 10ms | 4% |
| Post-process (silence removal, fade, norm) | 10ms | 4% |
Token generation dominates. Decode and post-processing are negligible. Optimization should focus on the forward pass.
Best strategy: split at first sentence, generate short first chunk for minimum TTFA, generate rest while first chunk plays.
| Steps | TTFA | First chunk plays | Rest gen time | Gap? |
|---|---|---|---|---|
| 32 | 335ms | 7.00s | 700ms | No — rest ready 6.3s early |
| 16 | 169ms | 7.00s | 357ms | No — rest ready 6.6s early |
| 8 | 87ms | 7.00s | 185ms | No — rest ready 6.8s early |
Even for the longest text (488 chars, 29s audio), there is zero playback gap at any step count. The first chunk produces 7s of audio, providing a massive buffer window. Margin is 6–7 seconds — enough to absorb network jitter, encoding, and client buffering.
| Scenario | Peak VRAM |
|---|---|
| Model loaded (idle) | 5.41 GB |
| 1 inference | 5.61 GB |
| 3 concurrent inferences | 5.87 GB |
| Headroom on 32 GB | 26.1 GB free |
| Estimated max concurrent | ~5 |
Incremental cost per concurrent inference is ~150 MB. Substantial headroom for additional model instances or concurrent requests.
| Metric | Chatterbox | OmniVoice (16 steps) | Winner |
|---|---|---|---|
| TTFA | ~250ms | 169ms | OmniVoice |
| RTF (medium text) | 0.05–0.10 | 0.020 | OmniVoice |
| Streaming type | True token-level | Chunk-level | Chatterbox |
| Playback gaps | None | None (7s buffer) | Tie |
| Voice quality | Good | Excellent | OmniVoice |
| Voice cloning | Embedding conditioning | Ref audio + text | OmniVoice |
| Languages | Polish + limited | 600+ | OmniVoice |
| VRAM usage | 4–6 GB | 5.6 GB | Tie |
| Concurrent users | 3–4 with slot pool | 3–5 on single GPU | Tie |
Same text, same voice (weronika), same GPU. Chatterbox generated via production server. OmniVoice at 16 steps (recommended) and 32 steps (best quality).
Step count and chunking tradeoffs within OmniVoice.
OmniVoice supports expressive non-verbal tags embedded directly in text. All samples: Polish, voice-cloned (weronika), 16 steps.
Supported tags: [laughter], [sigh], [confirmation-en], [question-ah], [question-oh], [question-ei], [question-yi], [surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo], [dissatisfaction-hnn]
| Tag | Text | Gen | Audio |
|---|---|---|---|
[laughter] | "...to sie stalo [laughter] naprawde nie moge." | 156ms | 3.91s |
[sigh] | "No coz [sigh] trzeba bylo to przewidziec." | 150ms | 2.41s |
[confirmation-en] | "[confirmation-en] tak, dokladnie o to mi chodzilo." | 152ms | 2.77s |
[question-ah] | "Naprawde tak uwazasz [question-ah] bo ja mam watpliwosci." | 153ms | 3.68s |
[question-oh] | "[question-oh] a to ciekawe, kiedy to sie stalo?" | 151ms | 2.74s |
[question-ei] | "Mowisz powaznie [question-ei] nie zartujesz?" | 147ms | 2.71s |
[surprise-ah] | "[surprise-ah] nie spodziewalam sie tego!" | 146ms | 2.61s |
[surprise-oh] | "[surprise-oh] to niesamowite co sie wydarzylo." | 151ms | 2.76s |
[surprise-wa] | "[surprise-wa] ale rewelacja, nie do wiary!" | 150ms | 2.30s |
[surprise-yo] | "Wygralismy konkurs [surprise-yo] fantastycznie!" | 148ms | 3.08s |
[dissatisfaction-hnn] | "[dissatisfaction-hnn] no nie wiem, to mnie nie przekonuje." | 155ms | 3.38s |
| Mixed (4 tags) | "...co sie stalo [question-ah] ...wygrali [surprise-oh] ...stracili [sigh] ...[dissatisfaction-hnn] trzeba bylo..." | 247ms | 10.44s |
| No tags (control) | "Nie moge uwierzyc, ze to sie stalo, naprawde nie moge." | 153ms | 3.40s |
[laughter] naprawde nie moge."[sigh] trzeba bylo to przewidziec."[confirmation-en] tak, dokladnie o to mi chodzilo."[dissatisfaction-hnn] no nie wiem, to mnie nie przekonuje."[question-ah] bo ja mam watpliwosci."[question-oh] a to ciekawe, kiedy to sie stalo?"[question-ei] nie zartujesz?"[surprise-ah] nie spodziewalam sie tego!"[surprise-oh] to niesamowite co sie wydarzylo."[surprise-wa] ale rewelacja, nie do wiary!"[surprise-yo] fantastycznie!"[question-ah] okazuje sie ze wygrali [surprise-oh] a potem wszystko stracili [sigh] no i co tu duzo mowic [dissatisfaction-hnn] trzeba bylo lepiej planowac."All tags produce distinct non-verbal sounds at the marked positions. The mixed sample (4 tags, 10.44s) demonstrates natural flow between speech and emotions. Generation time stays consistent (~150ms) regardless of tag count; the mixed sample is longer (247ms) only because the text itself is longer.
OmniVoice is viable for real-time streaming TTS and outperforms Chatterbox on raw speed metrics. The masked-diffusion architecture prevents true token-level streaming, but sentence-level chunking achieves 169ms TTFA at 16 steps with zero playback gaps. Combined with excellent voice quality, 600+ language support, and low VRAM footprint, OmniVoice is a strong candidate for production TTS deployment.
For consulting on real-time TTS integration, streaming architecture, or AI/ML engineering, contact Folx.