Relevant source files
The following files were used as context for generating this wiki page: - [src/omega13/transcription.py](https://github.com/b08x/omega-13/blob/main/src/omega13/transcription.py) - [src/omega13/app.py](https://github.com/b08x/omega-13/blob/main/src/omega13/app.py) - [src/omega13/session.py](https://github.com/b08x/omega-13/blob/main/src/omega13/session.py) - [src/omega13/config.py](https://github.com/b08x/omega-13/blob/main/src/omega13/config.py) - [README.md](https://github.com/b08x/omega-13/blob/main/README.md) - [src/omega13/ui.py](https://github.com/b08x/omega-13/blob/main/src/omega13/ui.py)Transcription & Whisper Integration
The transcription system in Omega-13 functions as a decoupled, asynchronous pipeline that bridges local audio capture with an externalized AI inference engine. It relies on a containerized whisper-server communicating via an HTTP API to transform .wav recordings into text, which is then processed for deduplication and session persistence.
1. System Architecture and Data Flow
The integration is built around the TranscriptionService class, which manages the lifecycle of transcription requests. The flow begins when the Omega13App triggers a recording stop, resulting in a saved audio file that is then dispatched to the service.
Transcription Pipeline Flow
The following diagram illustrates the sequence from audio finalization to text delivery.
Sources:
[src/omega13/transcription.py:#L83-L110], [src/omega13/app.py:#L145-L160], [README.md]
2. Core Components and Mechanisms
TranscriptionService
This class encapsulates the HTTP logic and threading required to interact with the Whisper backend. It uses a threading.Thread with daemon=False to ensure that transcription tasks are not abruptly killed during application exit, though it implements a _shutdown_event for cooperative termination.
Key Attributes:
| Attribute | Description | Source |
|---|---|---|
server_url |
Base URL for the whisper-server (default: http://localhost:8080) |
[src/omega13/transcription.py:#L36] |
endpoint |
Concatenation of URL and /inference path |
[src/omega13/transcription.py:#L43] |
timeout |
600-second limit for inference requests | [src/omega13/transcription.py:#L38] |
Data Structures
The system utilizes a TranscriptionResult dataclass to pass structured information back to the UI.
@dataclass
class TranscriptionResult:
text: str
status: TranscriptionStatus
error: Optional[str] = None
segments: Optional[list[dict]] = None
language: Optional[str] = None
duration: Optional[float] = None
Sources: [src/omega13/transcription.py:#L24-L31]
3. Session Integration and Deduplication
A critical, albeit slightly annoying, architectural detail is how the system handles overlapping audio. Since Omega-13 captures “13 seconds before” the trigger, consecutive recordings often contain redundant speech. The Session class in session.py attempts to fix this shit by performing word-based suffix-prefix matching.
Deduplication Logic
The add_transcription method compares the new text against the last five entries in the session history. It identifies the longest overlapping word sequence and strips it from the new segment before saving.
history_context = " ".join(self.transcriptions[-5:]).split()
new_words = new_text.split()
# Find the longest suffix of history that matches the prefix of new_words
max_overlap = 0
for i in range(1, min(len(history_context), len(new_words)) + 1):
if history_context[-i:] == new_words[:i]:
max_overlap = i
unique_segment = " ".join(new_words[max_overlap:])
Sources: [src/omega13/session.py:#L22-L55]
4. Configuration and Environment
The transcription behavior is governed by the ConfigManager. While the system defaults to a local server, it is hardcoded to expect specific response keys like text.
| Config Field | Default Value | Description |
|---|---|---|
enabled |
True |
Global toggle for transcription |
server_url |
http://localhost:8080 |
Endpoint for the Docker container |
model_size |
large-v3-turbo |
The specific Whisper model requested |
copy_to_clipboard |
False (UI Toggle) |
Automatic sync to system clipboard |
Sources: [src/omega13/config.py:#L32-L40], [src/omega13/app.py:#L100]
5. Structural Observations and Contradictions
The architecture presents a few interesting operational tendencies:
-
Dependency Paradox: The application is designed as a TUI, yet it is functionally useless for its primary purpose without a heavy Docker-based CUDA backend. If the
whisper-serveris down, theTranscriptionServicereturns anERRORstatus, but the audio is still saved to the session, creating a “silent” session metadata file. -
Thread Management: The switch from
daemon=Truetodaemon=Falseintranscribe_asyncindicates a move toward “cooperative shutdown.” However, the code still uses a_shutdown_eventthat requires the worker thread to check it manually, which could still lead to hangs if therequests.postcall is blocked mid-inference. - Deduplication Sensitivity: The deduplication relies entirely on exact word matches. Any slight variation in Whisper’s output for the same audio (due to hallucinations or temperature) will cause the deduplication to fail, resulting in stuttered text in the final session file.
Sources: [src/omega13/transcription.py:#L107], [src/omega13/session.py:#L45], [CHANGELOG.md]
Conclusion
The Transcription & Whisper Integration is the primary data-consumer of the Omega-13 system. It transforms transient audio buffers into persistent, deduplicated text through an asynchronous HTTP-based bridge to a containerized inference engine. Its structural significance lies in its role as the final stage of the “retroactive” pipeline, ensuring that captured thoughts are not just recorded, but immediately actionable via the clipboard and session storage.