Hacker Newsnew | past | comments | ask | show | jobs | submit | trowngon's commentslogin


I believe most people already moved to offline engines. No need to send the data to some random guys like this Assembly. Nemo Conformer from Nvidia, Robust Wav2Vec from Facebook, Vosk. There are dozen options. And the cost is $0.01 per hour, not $0.89 per hour like here.

Another advantage is that you can do more custom things - add words to vocabulary, detect speakers with biometric features, detect emotions.


without talking about accuracy any comparison is meaningless.


You don't even need to compare accuracy, you can just check the technology. Facebook model is trained on 256 GPU cards and you can fine-tune it to your domain in a day or two. The release was 2 month ago. There is no way any cloud startup can have something better in production given they have access to just 4 Titan cards.


Movies are no longer just an imagination. Chances are they are shaping the world for real superhero superhuman arrival ;)


Deepspeech project is closed by Mozilla. Developers fired. Now they are Coqui.


Thanks, I had no idea. Any idea why this happened? The devs don’t appear to be incompetent in any way. Do you also have any idea how Coqui currently compares to the un-maintained deepspeech?



Looks like a fork of Mozilla DeepSpeech by former DeepSpeech developers. What is the relation to the original project?


tl;dr:

- Mozilla fired the developers and mothballed the project

- But wants to keep it around as a museum piece

All ongoing development is happening in the fork.


I was very sad when that happened. There were a lot of language communities organizing their efforts around that project too.


Yeah, they really dumped them in it, fortunately the devs are keeping it all going at coqui.ai and are really supportive of any community that got abandoned by Moz.


any idea how they are financed?


Looks like after nvidia 1.5m grant devs returned back ;)


Irrespective of the subject Deepspeech is very old archtecture with suboptimal results. You'd better try any recent conformer implementations (flashlight, nemo, wenet, etc) or wav2vec.


I ended up with DeepSpeech since it was very easy to get started with, and it has support for fairly low-latency inferencing which is very important for my project.

I will take a look at the ones you suggested though, thanks for the heads-up!


Assembly AI is at $0.5 per hour, not extremely reasonable these days. With open source models like Facebook RASR or Vosk you can get self-hosted solution with even better accuracy and cost of $0.05 per hour, 10 times cheaper.

Once any of your customers come to require private video setup, you'll come to hosted solution anyway.


I'll say this clearly: Assembly AI is the hands down best speech to text transcription service on price, quality, speed and support. Hands down. It more than pays for itself time and time again.

If we were free, then I might have this concern, but our average customer value is far beyond the cost to transcribe.

Actually this is insanely cheap, especially given how much they are regularly improving. Amazon costs over $1.25, RevAI costs over $2/hour, and Google Speech to Text is over $2.15/hour.

We have a shared Slack channel with their team and I can't convey how incredible they have been. Literally the moment we have a question, we get instant responses.

Also our users will not poor transcripts. Every word that needs correcting is time/money lost for them, so our goal is to give them the highest quality transcripts.

We want to focus on our key value prop. Transcription is not one of them. We focus on user experience, design options, and speed.


Are there open source projects like this?


I have CookieTTS where I reseach lots of experimental stuff. (You can see my credits on the 'Thanks' section of 15.ai)

I can get about 90% of the quality of 15.ai currently. I think I could surpass 15.ai but not without some help.


There's Mozilla TTS https://github.com/mozilla/TTS

Here's a sample from a TTS model + vocoder I released for it. I've no wish to deter the motivated, but it'd take a bit of figuring out how to set things up and you'd need to read the docs and code to get oriented :)

https://m.soundcloud.com/user-726556259/sherlock-wavegrad-sa...

Links to the models are here: https://discourse.mozilla.org/t/creating-a-github-page-for-h...

Is originally trained on two novels read by the same narrator on LibriVox (ie in public domain)


This is actually quite impressive too, significantly better than the last time I looked into Mozilla TTS. Roughly how much audio does "two novels" equate to?


Here's another sample with the same model+vocoder, this time reading from a Wikipedia article: https://m.soundcloud.com/user-726556259/q-learning-wavegrad-...


It's about 32 hours of audio.

As some of the audio is read in different accents to the main accent used, ideally the different accent audio would have been removed. Doing so would be expected to help with voice quality, reducing the overall amount used and, as a bonus, cutting training time too.


Is there a simple interface like the example in this thread to use the tool for a non developer regard Mozila TTS yet? I can't find one...


There's the demo server which has a simple web UI where you can input text to be spoken, but in regards to setting it up locally it's not that suited for a non developer

https://github.com/mozilla/TTS/tree/master/TTS/server

https://github.com/mozilla/TTS/wiki/Build-instructions-for-s...

There's also a version in docker: https://github.com/synesthesiam/docker-mozillatts

And various Colabs too, which are fairly easy to get going with: https://github.com/mozilla/TTS/wiki/TTS-Notebooks-and-Tutori...


Vosk supports both Italian and French. French model is trained by Linto project, pretty good one.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: