Speech Recognition (STT)¶
Turning spoken audio into text is a problem on its own, useful even if you never touch the rest of S.T.A.R.K. This page covers S.T.A.R.K.'s speech-recognition layer: the protocol it's built around, ready-to-use implementations, and how to plug in your own engine.
S.T.A.R.K. defines the protocol so any backend is a drop-in replacement inside S.T.A.R.K. itself, but the protocol is deliberately agnostic, not tied to the rest of the framework. Wrap any STT engine behind it and you get the same drop-in swappability in any other Python project, with a couple of ready-made implementations included out of the box.
Use it standalone¶
SpeechRecognizer implementations don't require CommandsManager or any other part of the framework, they're a thin, swappable interface around an STT engine.
from stark.interfaces.vosk import VoskSpeechRecognizer
recognizer = VoskSpeechRecognizer(model_url='...') # pick a model: https://alphacephei.com/vosk/models
class PrintResults:
async def speech_recognizer_did_receive_final_result(self, result: str):
print('Final:', result)
async def speech_recognizer_did_receive_partial_result(self, result: str):
print('Partial:', result)
async def speech_recognizer_did_receive_empty_result(self):
pass
recognizer.delegate = PrintResults()
await recognizer.start_listening() # transcribes the microphone, no commands involved
Want it wired into a full assistant? See How to Run.
Ready Implementations¶
VoskSpeechRecognizer¶
Offline speech recognition via the Vosk library. Downloads and caches the chosen model on first use; no internet required after that.
VoskSpeechRecognizer(model_url: str, language_code: str | None = None, speaker_model_url: str | None = None)
Pass speaker_model_url to enable speaker-embedding extraction (used for cross-recognizer matching when running multiple languages simultaneously, not yet used for diarization, but the data is captured).
The Protocol¶
from typing import Protocol, runtime_checkable
@runtime_checkable
class SpeechRecognizerDelegate(Protocol):
async def speech_recognizer_did_receive_final_result(self, result: str): pass
async def speech_recognizer_did_receive_partial_result(self, result: str): pass
async def speech_recognizer_did_receive_empty_result(self): pass
@runtime_checkable
class SpeechRecognizer(Protocol):
is_recognizing: bool
delegate: SpeechRecognizerDelegate | None
async def start_listening(self): pass
def stop_listening(self): pass
SpeechRecognizerDelegatereceives the results: a final transcript, an interim/partial transcript (useful for live captions or barge-in detection), or a signal that nothing was heard.SpeechRecognizeris the input side:start_listening/stop_listeningcontrol the mic (or whatever audio source you point it at), andis_recognizingreports current state.
Implementing Your Own¶
Any class that satisfies the SpeechRecognizer protocol is a drop-in replacement, pass it to run() exactly like VoskSpeechRecognizer. Use the actual VoskSpeechRecognizer source as a reference implementation:
This is exactly where a cloud STT backend (Whisper API, Google Speech-to-Text, Azure) or a different offline engine would plug in, and a great first contribution if one doesn't exist yet. See Roadmap.