Use Whisper! It's a fairly small AI speech-to-text model that's great for gettin...

calebjosue · on Nov 21, 2023

Thank you! Taking a quick look the source code you are linking here I realize you first need to record the audio (again it was a quick look). I am thinking in an human-computer interaction more like the apple remote controller provides you to tell Apple TV applications the command (Most likely a search keyword). You know, no "middle-man", just the audio sent by the user and engine inference to act upon the received data.

smoldesu · on Nov 21, 2023

Generally, you'll have to build that layer yourself. No toolkits or SDKs that I'm aware of let you control a computer with an arbitrary prompt. That would probably be best relegated to an LLM that offers the user a suggested workflow and lets them confirm it.

> I realize you first need to record the audio

There are implimentations that let you stream audio in realtime, but it is rather finnecky and will require basic knowledge of audio DSP. I never set it up though, ymmv.

neontomo · on Nov 21, 2023

I would cut the task into separate problems and solve them one at a time. You could do it like this:

1. Trigger - pressing a button/terminal command

2. Record audio for 5-10 seconds (this will unfortunately lead to silence at the end of the file because it has a set audio length)

3. Run through Whisper API or run locally

4. Manipulate the response (Python can run local commands)

I've been considering building this myself lately for some kind of home-AI on a Raspberry PI, but I don't know if the steps would look exactly like this. My first thought was to cut the audio if it's silent for too long but not sure if you can process it at the same time as recording (probably). ffmpeg has some tools for detecting low dB.

calebjosue · on Nov 21, 2023

Thank you for the answers guys! I think I am going with the API version instead (Sending audio the API). Still, I putting this on hold for the moment. Thanks!!!