70 lines
4.0 KiB
Markdown
70 lines
4.0 KiB
Markdown
# Implementation Plan: Toak (Linux Dictation System)
|
|
|
|
Based on the `PROJECT_PLAN.md`, this actionable implementation plan breaks the project down into concrete, sequential steps.
|
|
|
|
## Phase 1: Project Setup & Core CLI
|
|
**Goal:** Initialize the project, set up configuration storage, and handle cross-process state (to support the "toggle" argument).
|
|
|
|
1. **Initialize Project:**
|
|
* Run `dotnet new console -n Toak -o src` or initialize in the root directory. Ensure it targets .NET 10.
|
|
2. **Configuration Management:**
|
|
* Create a `ConfigManager` to load/save user settings (Groq API Key, enabled prompt modules) to `~/.config/toak/config.json`.
|
|
3. **CLI Argument Parsing:**
|
|
* Parse the `toggle` argument to initiate or stop the recording workflow.
|
|
* Add a `setup` argument for an interactive CLI wizard to acquire the Groq API key and preferred typing backend (`wtype` vs `xdotool`).
|
|
4. **State Management (The Toggle):**
|
|
* Since `toggle` is called from a hotkey (meaning a new process starts each time), implement a state file (e.g., `/tmp/toak.pid`) or a local socket to communicate the toggle state. If recording, the second toggle should signal the existing recording process to stop and proceed to Phase 3.
|
|
5. **Notifications:**
|
|
* Implement a simple wrapper to call `notify-send "Toak" "Message"` to alert the user of state changes ("Recording Started", "Transcribing...", "Error").
|
|
|
|
## Phase 2: Audio Capture
|
|
**Goal:** Safely record audio from the active microphone.
|
|
|
|
1. **AudioRecorder Class:**
|
|
* Implement a method to start an `ffmpeg` (or `arecord`) process that saves to `/tmp/toak_recording.wav`.
|
|
* For example: `ffmpeg -f alsa -i default -y /tmp/toak_recording.wav`.
|
|
2. **Process Management:**
|
|
* Ensure the recording process can be gracefully terminated (sending `SIGINT` or standard .NET `Process.Kill`) when the "toggle stop" is received.
|
|
|
|
## Phase 3: The Groq STT & LLM Pipeline
|
|
**Goal:** Send the audio to Groq Whisper and refine it using Llama 3.1.
|
|
|
|
1. **GroqApiClient:**
|
|
* Initialize a generic `HttpClient` wrapper tailored for the Groq API.
|
|
2. **Transcription (Whisper):**
|
|
* Implement `TranscribeAsync(string filePath)`.
|
|
* Use `MultipartFormDataContent` to upload the `.wav` file to `whisper-large-v3-turbo`.
|
|
* Parse the returned text.
|
|
3. **Dynamic Prompt Builder:**
|
|
* Build the `PromptBuilder` class.
|
|
* Read the `ConfigManager` to conditionally append instructions (Punctuation, SAP/HANA rules, Style Modes) to the base system prompt.
|
|
* Enforce the prompt injection safe-guard: `"Output ONLY the corrected text for the data inside the <transcript> tags."`
|
|
4. **Refinement (Llama 3.1):**
|
|
* Implement `RefineTextAsync(string rawTranscript, string systemPrompt)`.
|
|
* Call `llama-3.1-8b-instant` with **Temperature = 0.0**.
|
|
* Wrap the user input in `<transcript>{rawTranscript}</transcript>`.
|
|
* Extract the cleaned text from the response.
|
|
|
|
## Phase 4: Text Injection
|
|
**Goal:** Pipe the final string into the active Linux window.
|
|
|
|
1. **Injector Class:**
|
|
* Build a utility class with an `Inject(string text)` method.
|
|
* Branch based on the user's display server configuration (Wayland vs. X11).
|
|
* **Wayland:** Execute `wtype "text"` (or `ydotool`).
|
|
* **X11:** Execute `xdotool type --clearmodifiers --delay 0 "text"`.
|
|
* *Alternative:* Copy the text to the clipboard and simulate `Ctrl+V`.
|
|
|
|
## Phase 5: Integration & Polish
|
|
**Goal:** Tie it all together and ensure performance/robustness.
|
|
|
|
1. **Workflow Orchestrator:**
|
|
* Combine the phases: `Toggle Stop` -> `Stop ffmpeg` -> `TranscribeAsync` -> `RefineTextAsync` -> `Inject`.
|
|
2. **Dependency Checking:**
|
|
* On startup, verify that `ffmpeg`, `notify-send`, and the chosen typing utility (`wtype`/`xdotool`) are installed in the system PATH.
|
|
3. **Performance Tuning:**
|
|
* Ensure STT and LLM HTTP calls are not blocked.
|
|
* Target < 1.5s total latency from the stop toggle to keystroke injection.
|
|
4. **Error Handling:**
|
|
* Add graceful fallback if the STT returns empty, or if network connectivity is lost. Notify the user via `notify-send`.
|