Voice Clone (TTS)
RapBotRito can generate spoken word audio from its own responses. The TTS pipeline is built as a provider adapter so we can swap vendors, but the current production implementation uses ElevenLabs for the Rito Rhymes voice clone.
User Flow
- The assistant bubble renders a “Play Audio” button (
AssistantAudioButton). - Clicking the button unlocks audio on the current gesture and either replays cached audio or requests TTS for the current assistant message.
- New audio responses are cached in memory and routed through the global Music Bar for playback.
Request + Playback Pipeline
AssistantAudioButton.tsxbuilds a speakable string viaextractSpeakableText, which removes inline tool tags and normalizes whitespace so the TTS only receives words meant to be spoken.- When no cached audio exists, the button POSTs
{ text, messageId }tottsPublicConfig.apiPathand includesAuthorization: Bearerwhen JWT gating is enabled. /api/ttsverifies JWTs (if configured), resolves the provider fromttsServerConfig, and callssynthesizeElevenLabsfor ElevenLabs.- The audio
ArrayBufferis stored inuseTtsAudioStore(keyed bymessageIdplus a text hash) and sent toMusicProvider.loadArrayBufferfor playback. - If the message text changes, the cached entry is cleared and regenerated.
Playback Behavior
- If the current track id matches the message id, the button seeks to
0and plays. - If audio exists in
useTtsAudioStorebut is not loaded, it loads the cachedArrayBufferwithMusicProvider.loadArrayBufferand plays. - If no cached audio exists (or the text hash changed), it requests
/api/tts, stores the result, then plays.
Local Audio Caching
The useTtsAudioStore keeps a per-message record:
status:idle,generating, orready.textHash: invalidates cached audio if the assistant message changes.arrayBuffer: raw bytes used byMusicProviderfor immediate playback.
This keeps audio local and avoids re-fetching when a user hits Play again.
Cleanup & Reset
useTtsAudioStore.clear()runs when the conversation is reset (messages length returns to 0 indapp/components/chatBot/index.tsx).- The store is in-memory, so a page refresh clears cached voice clips.
- Errors clear the per-message entry so the next click re-attempts generation.
Provider Adapter
TTS providers live under dapp/app/lib/tts/providers. The registry selects the active provider from TTS_PROVIDER.
elevenlabs: current production provider.disabled: returns a 503 from/api/tts.
The adapter layer is intentionally thin so we can add new providers without changing the UI.
Files
Storybook Demo
The TTS pipeline is designed to be multi-provider. ElevenLabs powers the current production voice clone, but the API shape and registry support additional providers without UI changes.