Skip to content

Speech Synthesis (TTS)

Turning text into spoken audio is a problem on its own, useful even if you never touch the rest of S.T.A.R.K. This page covers S.T.A.R.K.'s speech-synthesis layer: the protocol it's built around, ready-to-use implementations, and how to plug in your own engine.

S.T.A.R.K. defines the protocol so any backend is a drop-in replacement inside S.T.A.R.K. itself, but the protocol is deliberately agnostic, not tied to the rest of the framework. Wrap any TTS engine behind it and you get the same drop-in swappability in any other Python project, with a couple of ready-made implementations included out of the box.

Use it standalone

SpeechSynthesizer implementations don't require CommandsManager or any other part of the framework, they're a thin, swappable interface around a TTS engine.

from stark.interfaces.silero import SileroSpeechSynthesizer

synthesizer = SileroSpeechSynthesizer(model_url='...')  # pick a model: https://github.com/snakers4/silero-models
result = await synthesizer.synthesize('Hello, Stark!')
await result.play()  # plays through your default audio output, no commands involved

Want it wired into a full assistant? See How to Run.

Ready Implementations

SileroSpeechSynthesizer

Offline synthesis via Silero models.

SileroSpeechSynthesizer(
    model_url: str,
    speaker: str = 'baya',
    threads: int = 4,
    device: str = 'cpu',
    torch_backends_quantized_engine: str = 'qnnpack',
)

GCloudSpeechSynthesizer

Cloud synthesis via Google Cloud Text-to-Speech. Requires credentials configured ahead of time.

GCloudSpeechSynthesizer(voice_name: str, language_code: str, json_key_path: str)

The Protocol

from typing import Protocol, runtime_checkable

@runtime_checkable
class SpeechSynthesizerResult(Protocol):
    async def play(self): pass

@runtime_checkable
class SpeechSynthesizer(Protocol):
    async def synthesize(self, text: str) -> SpeechSynthesizerResult: pass
  • SpeechSynthesizer takes text and returns a result object, synthesis and playback are separate steps, so you can synthesize ahead of time, queue results, or route audio elsewhere instead of playing immediately.
  • SpeechSynthesizerResult wraps whatever the backend produced and knows how to play itself via play().

Implementing Your Own

Any class that satisfies the SpeechSynthesizer protocol is a drop-in replacement, pass it to run() exactly like SileroSpeechSynthesizer. Use the actual SileroSpeechSynthesizer source as a reference implementation:

This is exactly where another cloud TTS backend (ElevenLabs, Azure, Amazon Polly) or a different offline engine would plug in, and a great first contribution if one doesn't exist yet. See Roadmap.