Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,48 @@

- Use the linux kernel guidelines for commenting insofar as they are applicable to Go (e.g. avoid stating the obvious)
- Use `any` instead of `interface{}` and in general use modern Go
- Make liberal use of line breaks; don't try to stuff structs onto one line

# Error handling

- Wrap errors when passing them back up the stack
- Do not silently ignore errors
- Be defensive with external I/O
- Do not be defensive with internal state (e.g. nil pointers, empty strings)
- It is the responsibility of the caller to make sure internal objects and parameters are initialized before calling a function that uses them.
- Let it crash or panic.

# Editing

- Do the simplest thing that works, but no simpler
- Do not mix task categories
- If asked to make a logic change, do not reorganise or refactor code
- If asked to refactor code, do not make logic or functionality changes

# Documentation

- Ensure that new command line options and env vars are documented in the CLI help and the README.md.

# Building and linting

- Build with `go build -o voxinput .`
- Lint with `go vet .`
- After updates to verstion.txt, the Nix flake or go.mod build with `nix build .`
- If vendor hash errors are found, use `fakeHash` then `nix build .` to get the correct hash

# Running

- Use the private scripts in the bringup/ directory
- Suggest creating bringing scripts if they do not exist already
- Example of a bringup script for realtime transcription
```sh
#!/bin/sh
export OPENAI_BASE_URL=http://localai-host:8081/v1 OPENAI_WS_BASE_URL=ws://localai-host:8081/v1/realtime VOXINPUT_TRANSCRIPTION_MODEL=whisper-large-turbo
export VOXINPUT_PROMPT="VoxInput LocalAI ROS2 LLM NixOS struct env var"
./voxinput listen
```
- Example of a bringup script for transcribing from a monitor device
```sh
#!/bin/sh
OPENAI_BASE_URL=http://ledbx:8081/v1 OPENAI_WS_BASE_URL=ws://ledbx:8081/v1/realtime VOXINPUT_TRANSCRIPTION_MODEL=whisper-large-turbo VOXINPUT_CAPTURE_DEVICE="Monitor of iFi (by AMR) HD USB Audio Analog Stereo" ./voxinput listen --output-file /tmp/transcript.txt
```
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,17 @@ Unless you don't mind running VoxInput as root, then you also need to ensure the
- `OPENAI_BASE_URL` or `VOXINPUT_BASE_URL`: The base URL of the OpenAI compatible API server: defaults to `http://localhost:8080/v1`
- `VOXINPUT_LANG` or `LANG`: Language code for transcription (defaults to empty).
- `VOXINPUT_TRANSCRIPTION_MODEL`: Transcription model (default: `whisper-1`).
- `VOXINPUT_ASSISTANT_MODEL`: Assistant model (default: `none`).
- `VOXINPUT_ASSISTANT_VOICE`: Assistant voice (default: `alloy`).
- `VOXINPUT_TRANSCRIPTION_TIMEOUT`: Timeout duration (default: `30s`).
- `VOXINPUT_SHOW_STATUS`: Show GUI notifications (`yes`/`no`, default: `yes`).
- `VOXINPUT_CAPTURE_DEVICE`: Specific audio capture device name (run `voxinput devices` to list).
- `VOXINPUT_OUTPUT_FILE`: Path to save the transcribed text to a file instead of typing it with dotool.
- `VOXINPUT_MODE`: Realtime mode (transcription|assistant, default: transcription).
- `XDG_RUNTIME_DIR` or `VOXINPUT_RUNTIME_DIR`: Used for the PID and state files, defaults to `/run/voxinput` if niether are present

**Warning**: Assistant mode is WIP and you may need a particular version of LocalAI's realtime API to run it because I am developing both in lockstep. Eventually though it should be compatible with at least OpenAI or LocalAI.

### Commands

- **`listen`**: Start speech to text daemon.
Expand All @@ -83,6 +88,7 @@ Unless you don't mind running VoxInput as root, then you also need to ensure the
- `--no-show-status`: Don't show when recording has started or stopped.
- `--output-file <path>`: Save transcript to file instead of typing.
- `--prompt <text>`: Text used to condition model output. Could be previously transcribed text or uncommon words you expect to use
- `--mode <transcription|assistant>`: Realtime mode (default: transcription)

```bash
./voxinput listen
Expand Down Expand Up @@ -226,7 +232,10 @@ The realtime mode has a UI to display various actions being taken by VoxInput. H
- [x] GUI and system tray
- [x] Voice detection and activation (partial, see below)
- [ ] Code words to start and stop transcription
- [ ] Allow user to describe a button they want to press (requires submitting screen shot and transcription to LocalAGI)
- [ ] Assistant mode
- [x] Voice conversations with an LLM
- [ ] Submit desktop images to a VLM to allow it to click on items
- [ ] Use tool calls or MCP to allow the VLM/LLM to perform actions

## Signals

Expand Down
3 changes: 2 additions & 1 deletion flake.nix
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@
# Path to the source code
src = ./.;

vendorHash = "sha256-+67Ajh+Jy5+mpYQCiUXDG5EKg72YtW0v9IUuswkmUXM="; #nixpkgs.lib.fakeHash;
vendorHash = "sha256-UH4oOSSl1Xq3aVzI+N97UrsYq/MN2M/ehud8RFHNoAg=";
# vendorHash = nixpkgs.lib.fakeHash;

nativeBuildInputs = with pkgs; [
makeWrapper
Expand Down
4 changes: 1 addition & 3 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,11 @@ go 1.24.2

require (
fyne.io/fyne/v2 v2.7.1
github.com/WqyJh/go-openai-realtime v0.6.0
github.com/WqyJh/go-openai-realtime v0.6.1
github.com/gen2brain/malgo v0.11.24
github.com/sashabaranov/go-openai v1.41.2
)

replace github.com/WqyJh/go-openai-realtime v0.6.0 => github.com/richiejp/go-openai-realtime v0.6.1-fix-created-event

require (
fyne.io/systray v1.11.1-0.20250603113521-ca66a66d8b58 // indirect
github.com/BurntSushi/toml v1.5.0 // indirect
Expand Down
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ fyne.io/systray v1.11.1-0.20250603113521-ca66a66d8b58 h1:eA5/u2XRd8OUkoMqEv3IBlF
fyne.io/systray v1.11.1-0.20250603113521-ca66a66d8b58/go.mod h1:RVwqP9nYMo7h5zViCBHri2FgjXF7H2cub7MAq4NSoLs=
github.com/BurntSushi/toml v1.5.0 h1:W5quZX/G/csjUnuI8SUYlsHs9M38FC7znL0lIO+DvMg=
github.com/BurntSushi/toml v1.5.0/go.mod h1:ukJfTF/6rtPPRCnwkur4qwRxa8vTRFBF0uk2lLoLwho=
github.com/WqyJh/go-openai-realtime v0.6.1 h1:QhWRrbx9ZeixN/kVH/5VyEgjoMei7xisHqw7ybbef2E=
github.com/WqyJh/go-openai-realtime v0.6.1/go.mod h1:BCN7J7AUbfSFkLLVnhGWF2OkvoQ7GqTWrU/w+d+QwR4=
github.com/WqyJh/jsontools v0.3.1 h1:zKT+DvxUSTji06ZcjsbQzZ48PycFZDI0OGATmmFhJ+U=
github.com/WqyJh/jsontools v0.3.1/go.mod h1:Gk2OlyXjAJmYNZ0aUbEXGHq4I5ihGRjXxVuUprWtkss=
github.com/coder/websocket v1.8.12 h1:5bUXkEPPIbewrnkU8LTCLVaxi4N4J8ahufH2vlo4NAo=
Expand Down Expand Up @@ -61,8 +63,6 @@ github.com/pkg/profile v1.7.0 h1:hnbDkaNWPCLMO9wGLdBFTIZvzDrDfBM2072E1S9gJkA=
github.com/pkg/profile v1.7.0/go.mod h1:8Uer0jas47ZQMJ7VD+OHknK4YDY07LPUC6dEvqDjvNo=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/richiejp/go-openai-realtime v0.6.1-fix-created-event h1:Q9cdsuGl6cWnQ8ceFLNYBcUe/C1ZX/6dJ10qttlrZbk=
github.com/richiejp/go-openai-realtime v0.6.1-fix-created-event/go.mod h1:BCN7J7AUbfSFkLLVnhGWF2OkvoQ7GqTWrU/w+d+QwR4=
github.com/rymdport/portal v0.4.2 h1:7jKRSemwlTyVHHrTGgQg7gmNPJs88xkbKcIL3NlcmSU=
github.com/rymdport/portal v0.4.2/go.mod h1:kFF4jslnJ8pD5uCi17brj/ODlfIidOxlgUDTO5ncnC4=
github.com/sashabaranov/go-openai v1.41.2 h1:vfPRBZNMpnqu8ELsclWcAvF19lDNgh1t6TVfFFOPiSM=
Expand Down
107 changes: 78 additions & 29 deletions internal/audio/audio.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ type StreamConfig struct {
Format malgo.FormatType
Channels int
SampleRate int
DeviceType malgo.DeviceType
MalgoContext malgo.Context
CaptureDeviceID *malgo.DeviceID
}
Expand All @@ -35,10 +34,7 @@ func (config StreamConfig) asDeviceConfig(deviceType malgo.DeviceType) malgo.Dev
if config.SampleRate != 0 {
deviceConfig.SampleRate = uint32(config.SampleRate)
}
if config.DeviceType != 0 {
deviceConfig.DeviceType = config.DeviceType
}
if config.CaptureDeviceID != nil {
if config.CaptureDeviceID != nil && (deviceType == malgo.Capture || deviceType == malgo.Duplex) {
deviceConfig.Capture.DeviceID = config.CaptureDeviceID.Pointer()
}
return deviceConfig
Expand All @@ -64,8 +60,8 @@ func (c *StreamConfig) SetCaptureDeviceByName(mctx *malgo.Context, name string)
return false, nil
}

func stream(ctx context.Context, abortChan chan error, config StreamConfig, deviceCallbacks malgo.DeviceCallbacks) error {
deviceConfig := config.asDeviceConfig(malgo.Capture)
func stream(ctx context.Context, abortChan chan error, config StreamConfig, deviceType malgo.DeviceType, deviceCallbacks malgo.DeviceCallbacks) error {
deviceConfig := config.asDeviceConfig(deviceType)
device, err := malgo.InitDevice(config.MalgoContext, deviceConfig, deviceCallbacks)
if err != nil {
return err
Expand Down Expand Up @@ -121,12 +117,10 @@ func ListCaptureDevices() error {
}

// Capture records incoming samples into the provided writer.
// The function initializes a capture device in the default context using
// provide stream configuration.
// Capturing will commence writing the samples to the writer until either the
// writer returns an error, or the context signals done.
// The function initializes a capture device in the default context using the
// provided stream configuration.
// XXX: Capture, Duplex and Playback are mutually exclusive, only use one at a time
func Capture(ctx context.Context, w io.Writer, config StreamConfig) error {
config.DeviceType = malgo.Capture
abortChan := make(chan error)
defer close(abortChan)
aborted := false
Expand All @@ -137,24 +131,25 @@ func Capture(ctx context.Context, w io.Writer, config StreamConfig) error {
return
}

_, err := w.Write(inputSamples)
if err != nil {
aborted = true
abortChan <- err
if len(inputSamples) > 0 {
_, err := w.Write(inputSamples)
if err != nil {
aborted = true
abortChan <- err
}
}
},
}

return stream(ctx, abortChan, config, deviceCallbacks)
return stream(ctx, abortChan, config, malgo.Capture, deviceCallbacks)
}

// Playback streams samples from a reader to the sound device.
// The function initializes a playback device in the default context using
// provide stream configuration.
// Playback will commence playing the samples provided from the reader until either the
// reader returns an error, or the context signals done.
func Playback(ctx context.Context, r io.Reader, config StreamConfig) error {
config.DeviceType = malgo.Playback
// Duplex streams audio from a reader to the playback device and captures audio
// from the capture device to a writer.
// It initializes a duplex device in the default context using the provided stream configuration.
// It expects both r and w to be non-nil.
// XXX: Capture, Duplex and Playback are mutually exclusive, only use one at a time
func Duplex(ctx context.Context, r io.Reader, w io.Writer, config StreamConfig) error {
abortChan := make(chan error)
defer close(abortChan)
aborted := false
Expand All @@ -164,20 +159,74 @@ func Playback(ctx context.Context, r io.Reader, config StreamConfig) error {
if aborted {
return
}
if frameCount == 0 {
return

if len(inputSamples) > 0 {
_, err := w.Write(inputSamples)
if err != nil {
aborted = true
abortChan <- err
return
}
}

read, err := io.ReadFull(r, outputSamples)
if read <= 0 {
if len(outputSamples) > 0 {
if frameCount == 0 {
return
}

read, err := r.Read(outputSamples)
if err != nil {
if err == io.EOF {
for i := read; i < len(outputSamples); i++ {
outputSamples[i] = 0
}
aborted = true
abortChan <- io.EOF
return
}
aborted = true
abortChan <- err
return
}
}
},
}

return stream(ctx, abortChan, config, malgo.Duplex, deviceCallbacks)
}

// Playback streams samples from the provided reader to the playback device.
// The function initializes a playback device in the default context using the
// provided stream configuration.
// XXX: Capture, Duplex and Playback are mutually exclusive, only use one at a time
func Playback(ctx context.Context, r io.Reader, config StreamConfig) error {
abortChan := make(chan error)
defer close(abortChan)
aborted := false

deviceCallbacks := malgo.DeviceCallbacks{
Data: func(outputSamples, inputSamples []byte, frameCount uint32) {
if aborted {
return
}

if len(outputSamples) > 0 {
if frameCount == 0 {
return
}

read, err := r.Read(outputSamples)
if err != nil {
aborted = true
abortChan <- err
return
}
for i := read; i < len(outputSamples); i++ {
outputSamples[i] = 0
}
}
},
}

return stream(ctx, abortChan, config, deviceCallbacks)
return stream(ctx, abortChan, config, malgo.Playback, deviceCallbacks)
}
11 changes: 9 additions & 2 deletions internal/gui/gui.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,17 @@ type Msg interface {
type ShowListeningMsg struct{}
type ShowSpeechDetectedMsg struct{}
type ShowTranscribingMsg struct{}
type ShowGeneratingResponseMsg struct{}
type HideMsg struct{}
type ShowStoppingMsg struct{}

func (m *ShowListeningMsg) IsMsg() bool { return true }
func (m *ShowSpeechDetectedMsg) IsMsg() bool { return true }
func (m *ShowTranscribingMsg) IsMsg() bool { return true }
func (m *HideMsg) IsMsg() bool { return true }
func (m *ShowStoppingMsg) IsMsg() bool { return true }

func (m *ShowGeneratingResponseMsg) IsMsg() bool { return true }
func (m *HideMsg) IsMsg() bool { return true }
func (m *ShowStoppingMsg) IsMsg() bool { return true }

type GUI struct {
a fyne.App
Expand Down Expand Up @@ -77,6 +80,10 @@ func New(ctx context.Context, showStatus string) *GUI {
if showStatus != "" {
ui.showStatus("Transcribing...", theme.FileTextIcon())
}
case *ShowGeneratingResponseMsg:
if showStatus != "" {
ui.showStatus("Generating response...", theme.FileAudioIcon())
}
case *HideMsg:
if ui.cancelTimer != nil {
ui.cancelTimer()
Expand Down
Loading