I believe most people already moved to offline engines. No need to send the data to some random guys like this Assembly. Nemo Conformer from Nvidia, Robust Wav2Vec from Facebook, Vosk. There are dozen options. And the cost is $0.01 per hour, not $0.89 per hour like here.
Another advantage is that you can do more custom things - add words to vocabulary, detect speakers with biometric features, detect emotions.
You don't even need to compare accuracy, you can just check the technology. Facebook model is trained on 256 GPU cards and you can fine-tune it to your domain in a day or two. The release was 2 month ago. There is no way any cloud startup can have something better in production given they have access to just 4 Titan cards.
Thanks, I had no idea. Any idea why this happened? The devs don’t appear to be incompetent in any way.
Do you also have any idea how Coqui currently compares to the un-maintained deepspeech?
Yeah, they really dumped them in it, fortunately the devs are keeping it all going at coqui.ai and are really supportive of any community that got abandoned by Moz.
Irrespective of the subject Deepspeech is very old archtecture with suboptimal results. You'd better try any recent conformer implementations (flashlight, nemo, wenet, etc) or wav2vec.
I ended up with DeepSpeech since it was very easy to get started with, and it has support for fairly low-latency inferencing which is very important for my project.
I will take a look at the ones you suggested though, thanks for the heads-up!
Assembly AI is at $0.5 per hour, not extremely reasonable these days. With open source models like Facebook RASR or Vosk you can get self-hosted solution with even better accuracy and cost of $0.05 per hour, 10 times cheaper.
Once any of your customers come to require private video setup, you'll come to hosted solution anyway.
I'll say this clearly: Assembly AI is the hands down best speech to text transcription service on price, quality, speed and support. Hands down. It more than pays for itself time and time again.
If we were free, then I might have this concern, but our average customer value is far beyond the cost to transcribe.
Actually this is insanely cheap, especially given how much they are regularly improving. Amazon costs over $1.25, RevAI costs over $2/hour, and Google Speech to Text is over $2.15/hour.
We have a shared Slack channel with their team and I can't convey how incredible they have been. Literally the moment we have a question, we get instant responses.
Also our users will not poor transcripts. Every word that needs correcting is time/money lost for them, so our goal is to give them the highest quality transcripts.
We want to focus on our key value prop. Transcription is not one of them. We focus on user experience, design options, and speed.
Here's a sample from a TTS model + vocoder I released for it. I've no wish to deter the motivated, but it'd take a bit of figuring out how to set things up and you'd need to read the docs and code to get oriented :)
This is actually quite impressive too, significantly better than the last time I looked into Mozilla TTS. Roughly how much audio does "two novels" equate to?
As some of the audio is read in different accents to the main accent used, ideally the different accent audio would have been removed. Doing so would be expected to help with voice quality, reducing the overall amount used and, as a bonus, cutting training time too.
There's the demo server which has a simple web UI where you can input text to be spoken, but in regards to setting it up locally it's not that suited for a non developer