I don't mind so much the size in MB, the fact that it's pure CPU and the quality...

Dayshine · 2025-08-06T07:03:27 1754463807

Nvidia's parakeet https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 appears to be state of the art for english: 10x faster than Whisper.

My mid-range AMD CPU is multiple times faster than realtime with parakeet.

colechristensen · 2025-08-06T06:07:22 1754460442

>Aside: Are there any models for understanding voice to text, fully offline, without training?

OpenAI's whisper is a few years old and pretty solid.

https://github.com/openai/whisper

Hackbraten · 2025-08-06T18:33:44 1754505224

Whisper tends to fill silence with random garbage from its training set. [0] [1] [2]

[0]: https://github.com/openai/whisper/discussions/679 [1]: https://github.com/openai/whisper/discussions/928 [2]: https://github.com/openai/whisper/discussions/2608

jiehong · 2025-08-06T06:39:33 1754462373

Voice to text fully offline can be done with whisper. A few apps offer it for dictation or transcription.

blensor · 2025-08-06T05:46:22 1754459182

"The brown fox jumps over the lazy dog.."

Average duration per generation: 1.28 seconds

Characters processed per second: 30.35

--

"Um"

Average duration per generation: 0.22 seconds

Characters processed per second: 9.23

--

"The brown fox jumps over the lazy dog.. The brown fox jumps over the lazy dog.."

Average duration per generation: 2.25 seconds

Characters processed per second: 35.04

--

processor : 0

vendor_id : AuthenticAMD

cpu family : 25

model : 80

model name : AMD Ryzen 7 5800H with Radeon Graphics

stepping : 0

microcode : 0xa50000c

cpu MHz : 1397.397

cache size : 512 KB

moffkalast · 2025-08-06T06:42:46 1754462566

Hmm that actually seems extremely slow, Piper can crank out a sentence almost instantly on a Pi 4 which is a like a sloth compared to that Ryzen and the speech quality seems about the same at first glance.

I suppose it would make sense if you want to include it on top of an LLM that's already occupying most of a GPU and this could run in the limited VRAM that's left.

keyle · 2025-08-06T05:57:36 1754459856

assuming most answers will be more than a sentence, 2.25 seconds is already long enough if you factor the token generation in between... and imagine with reasoning!... We're not there yet.

Teever · 2025-08-06T05:06:39 1754456799

Any idea what factors play into latency in TTS models?

divamgupta · 2025-08-06T07:21:52 1754464912

Mostly model size, and input size. Some models which use attention are O(N^2)