Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How do you get started with adding voice commands to a computer system?
2 points by calebjosue on Nov 21, 2023 | hide | past | favorite | 6 comments
Let's suppose you want to add support for voice commands to a Linux Distro.

For simplicity's sake, let's say you want to be able to tell the computer (The terminal is running): "Create XY directory" and as a response the directory XY is created on the current directory.

How do you implement such a feature?

Will a Software developer first need to train a system over lots of people pronouncing "Create directory" phrases. And then perform inference on production?

Are some corporations/start-ups already providing trained models for natural language - computer interaction?

How do you get started these sort of tasks these days?

And of course, for accessibility purposes, text-based interaction remains unchanged.

Thanks!



Use Whisper! It's a fairly small AI speech-to-text model that's great for getting your feet wet with AI libraries. It's extremely precise and easy to get working, I recommend it over pretty much everything else.

https://github.com/openai/whisper


Thank you! Taking a quick look the source code you are linking here I realize you first need to record the audio (again it was a quick look). I am thinking in an human-computer interaction more like the apple remote controller provides you to tell Apple TV applications the command (Most likely a search keyword). You know, no "middle-man", just the audio sent by the user and engine inference to act upon the received data.


Generally, you'll have to build that layer yourself. No toolkits or SDKs that I'm aware of let you control a computer with an arbitrary prompt. That would probably be best relegated to an LLM that offers the user a suggested workflow and lets them confirm it.

> I realize you first need to record the audio

There are implimentations that let you stream audio in realtime, but it is rather finnecky and will require basic knowledge of audio DSP. I never set it up though, ymmv.


I would cut the task into separate problems and solve them one at a time. You could do it like this:

1. Trigger - pressing a button/terminal command

2. Record audio for 5-10 seconds (this will unfortunately lead to silence at the end of the file because it has a set audio length)

3. Run through Whisper API or run locally

4. Manipulate the response (Python can run local commands)

I've been considering building this myself lately for some kind of home-AI on a Raspberry PI, but I don't know if the steps would look exactly like this. My first thought was to cut the audio if it's silent for too long but not sure if you can process it at the same time as recording (probably). ffmpeg has some tools for detecting low dB.


Thank you for the answers guys! I think I am going with the API version instead (Sending audio the API). Still, I putting this on hold for the moment. Thanks!!!





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: