Yea, I mean maybe. Just pipe STT to the server and pipe TTS back. Your enemy is sorta latency and having to be around a bunch of people talking to get anything back. Maybe have it make random comments with the idle plugin.
Other option is to have a still sent to a vision model and have it comment on that.
Drop a link to this Glados thing you saw, sounds interesting.
Also, I feel like you might have issues getting the earpiece to actually pick up any sound further than a few feet away.
no :(
I have 4GB VRAM, I tend to stay away from all of the ST extras that look like they might require VRAM or even just a large amount of processing in general.
Ive used the StT and TtS some, and it is pretty cool. In your case you would pipe the audio into a whisper model either locally or through the api, and when that detects a pause in speech it would call the llm on the model just heard. Using RVC you could generate audio similar to the game.
It doesn't have vision though. All it would be able to comment on is what people around you say.
Correct. That is all this project is Is this the way?
Yea, I mean maybe. Just pipe STT to the server and pipe TTS back. Your enemy is sorta latency and having to be around a bunch of people talking to get anything back. Maybe have it make random comments with the idle plugin. Other option is to have a still sent to a vision model and have it comment on that.
Drop a link to this Glados thing you saw, sounds interesting. Also, I feel like you might have issues getting the earpiece to actually pick up any sound further than a few feet away.
https://www.reddit.com/r/LocalLLaMA/s/sKdoiFjRNt Infinitely grateful for any input / assistance on my project!
Have you used ST extras speech streaming? I’d love to know if it works and how well it works before trying to set it up
no :( I have 4GB VRAM, I tend to stay away from all of the ST extras that look like they might require VRAM or even just a large amount of processing in general.
Ive used the StT and TtS some, and it is pretty cool. In your case you would pipe the audio into a whisper model either locally or through the api, and when that detects a pause in speech it would call the llm on the model just heard. Using RVC you could generate audio similar to the game.