Code Documentation

This document provides an overview of the code of the tts package.

`TextToSpeechNode`

The TextToSpeechNode class is a ROS2 node that acts as a client to an OpenAI-compatible TTS server. We use OrpheusTTS, which models are in the GGUF format, so they are usable in Llama.CPP, which is usually used for LLMs, but we can also use Llama.CPP as an TTS server instead. Note: The node leverages the SNAC (Scalable Neural Audio Codec) model for audio decoding and supports streaming audio generation for real-time text-to-speech conversion, since Llama.CPP doesn't support TTS natively.

Parameters

The node exposes the following ROS2 parameters:

Parameter	Type	Description	Default Value
`server_url`	string	The URL of the Llama.CCP server's completions endpoint for TTS inference.	`http://localhost:8080/v1/completions`
`en_model`	string	Model identifier for English TTS.	`en`
`en_voice`	string	Voice profile to use for English text-to-speech.	`leah`
`en_max_tokens`	integer	Maximum number of tokens to generate for English TTS.	`10240`
`en_temperature`	double	Controls randomness in English TTS generation. Higher values increase creativity.	`0.6`
`en_top_p`	double	Nucleus sampling parameter for English TTS. Controls diversity of token selection.	`0.9`
`en_repeat_penalty`	double	Penalty for token repetition in English TTS to encourage more varied output.	`1.1`
`de_model`	string	Model identifier for German TTS.	`de`
`de_voice`	string	Voice profile to use for German text-to-speech.	`max`
`de_max_tokens`	integer	Maximum number of tokens to generate for German TTS.	`10240`
`de_temperature`	double	Controls randomness in German TTS generation. Higher values increase creativity.	`0.6`
`de_top_p`	double	Nucleus sampling parameter for German TTS. Controls diversity of token selection.	`0.9`
`de_repeat_penalty`	double	Penalty for token repetition in German TTS to encourage more varied output.	`1.1`

SNAC Model Initialization

The node initializes the SNAC (Scalable Neural Audio Codec) model during startup. SNAC is used to decode the audio tokens generated by the TTS model into raw audio data. The model automatically selects CUDA if available, otherwise falls back to CPU processing.

Services

The node provides one main service:

`/tts`

Type: ric_messages/srv/TextToAudioBytes
Description: This is the main service for converting text to audio. It takes text input and a language specification, then returns the generated audio as WAV-formatted bytes.
Request:
- text (string): The text to convert to speech.
- language (string): The target language for synthesis. Supports "english"/"en" and "german"/"de".
Response:
- audio (bytes): The generated audio data in WAV format, ready for playback or further processing.

How it Works

Initialization: The node starts, declares its parameters for both English and German TTS, initializes the SNAC model, and creates the TTS service.
Service Call: Another ROS2 node calls the /tts service with text and language parameters.
Language Processing: The text_to_speech_callback is triggered. It normalizes the language parameter (converting "english" to "en" and "german" to "de") and validates that the language is supported.
Parameter Retrieval: The node retrieves the appropriate model parameters based on the requested language (model name, voice, temperature, etc.).
Prompt Building: The text is formatted using the build_prompt helper function, which wraps the input text with the appropriate voice tags and special tokens required by the OrpheusTTS model.
Streaming Generation: The node sends a streaming request to the Llama.CPP server via _generate_response():
Sends an HTTP POST request with the formatted prompt and generation parameters
Processes the server-sent events (SSE) stream response
Filters the response to extract only tokens containing audio data (custom tokens)
Real-time Audio Decoding: As audio tokens are generated:
The tokens_decoder_sync function processes the token stream
Tokens are converted to audio codes and passed to the SNAC model
The SNAC model decodes the codes into raw audio samples
Audio samples are converted to 16-bit PCM format and yielded as byte chunks
WAV File Assembly:
A WAV header is created using create_wav_header()
Audio byte chunks are collected and combined with the header
The complete WAV file is returned as the service response
Error Handling: The node includes comprehensive error handling for network issues, invalid responses, unsupported languages, and audio generation failures.

Key Features

Streaming Audio Generation: Audio is generated and returned in real-time as tokens are produced, enabling low-latency TTS.
Multi-language Support: Supports both English and German with separate parameter sets for each language, but other available languages in OrpheusTTS can be used as well.
Flexible Voice Selection: Different voice profiles can be configured for each language.
SNAC Audio Decoding: Uses state-of-the-art neural audio codec for high-quality audio synthesis.
WAV Format Output: Returns standard WAV-formatted audio compatible with most audio systems.
Robust Error Handling: Comprehensive error checking and logging throughout the pipeline.

Helper Module (`helper.py`)

The helper module provides essential utility functions for the TTS system, handling prompt formatting, WAV file creation, and token validation.

Functions

`string_contains_token(string: str) -> bool`

Description: Checks if a string contains any custom audio token using regex pattern matching.
Parameters:
string (str): The input string to check for custom tokens
Returns: bool - True if the string contains custom tokens, False otherwise
Usage: Used to filter streaming responses and identify chunks containing audio data

`build_prompt(voice: str, prompt: str) -> str`

Description: Constructs the properly formatted prompt string required by the OrpheusTTS model, wrapping the input text with voice tags and special tokens.
Parameters:
voice (str): The voice profile to use (e.g., "leah", "max")
prompt (str): The text content to be converted to speech
Returns: str - The formatted prompt string with OrpheusTTS-specific tokens
Format: <custom_token_3>{voice}: {prompt}<|eot_id|><custom_token_4><custom_token_5><custom_token_1>

`create_wav_header(sample_rate=24000, bits_per_sample=16, channels=1)`

Description: Creates a standard WAV file header with the specified audio parameters. This function is adapted from the OrpheusTTS project.
Parameters:
sample_rate (int): Audio sample rate in Hz (default: 24000)
bits_per_sample (int): Bit depth of audio samples (default: 16)
channels (int): Number of audio channels (default: 1 for mono)
Returns: bytes - The WAV header as a byte string
Technical Details: Uses struct.pack to create a proper RIFF/WAVE header format

Constants

AUDIO_TOKENS_REGEX: Regular expression pattern r"<custom_token_(\d+)>" used to identify custom audio tokens in the streaming response

Decoder Module (`decoder.py`)

The decoder module handles the conversion of TTS model tokens into actual audio using the SNAC (Scalable Neural Audio Codec) model.

This module is adapted from the OrpheusTTS project and serves as a temporary solution until Llama.CPP gains native TTS support.

Global Variables

model: The global SNAC model instance used for audio decoding
snac_device: The device (CPU/CUDA) where the SNAC model is loaded

Functions

`initialize_snac_model()`

Description: Initializes the global SNAC model for audio decoding. Automatically detects and uses CUDA if available, otherwise falls back to CPU.
Device Selection: Uses the SNAC_DEVICE environment variable or auto-detects the best available device
Model: Loads the pre-trained SNAC model from "hubertsiuzdak/snac_24khz"

`convert_to_audio(multiframe)`

Description: Converts a sequence of audio codes into raw audio bytes using the SNAC model.
Parameters:
multiframe: List of audio codes representing frames to be decoded
Returns: bytes - Raw audio data in 16-bit PCM format, or None if conversion fails
Process:
Validates that the multiframe contains at least 7 codes
Organizes codes into three hierarchical levels (codes_0, codes_1, codes_2)
Performs bounds checking to ensure all codes are within valid range (0-4096)
Uses SNAC model to decode codes into audio waveform
Converts float audio to 16-bit integer format and returns as bytes

`turn_token_into_id(token_string, index)`

Description: Extracts and converts custom tokens from the streaming response into audio code IDs.
Parameters:
token_string (str): String containing the custom token
index (int): Current position in the token sequence
Returns: int - The audio code ID, or None if parsing fails
Logic: Extracts the numeric part from custom tokens and applies mathematical transformation based on the index position

`tokens_decoder(token_gen)` (Async)

Description: Asynchronous generator that processes a stream of tokens and yields audio chunks in real-time.
Parameters:
token_gen: Async generator yielding token strings
Yields: bytes - Audio data chunks as they become available
Buffering Strategy:
Maintains a buffer of audio codes
Processes codes in groups of 7 (representing one audio frame)
Yields audio when buffer contains at least 28 codes (4 frames)
Uses overlapping windows for smooth audio generation

`tokens_decoder_sync(syn_token_gen)`

Description: Synchronous wrapper around the async tokens_decoder function, enabling integration with synchronous code.
Parameters:
syn_token_gen: Synchronous generator yielding token strings
Yields: bytes - Audio data chunks
Implementation:
Converts synchronous generator to async generator
Runs async decoder in a separate thread
Uses a queue to bridge async and sync worlds
Returns audio chunks as they become available

Code Documentation

TextToSpeechNode

Parameters

SNAC Model Initialization

Services

/tts

How it Works

Key Features

Helper Module (helper.py)

Functions

string_contains_token(string: str) -> bool

build_prompt(voice: str, prompt: str) -> str

create_wav_header(sample_rate=24000, bits_per_sample=16, channels=1)

Constants

Decoder Module (decoder.py)

Global Variables

Functions

initialize_snac_model()

convert_to_audio(multiframe)

turn_token_into_id(token_string, index)

tokens_decoder(token_gen) (Async)

tokens_decoder_sync(syn_token_gen)

`TextToSpeechNode`

`/tts`

Helper Module (`helper.py`)

`string_contains_token(string: str) -> bool`

`build_prompt(voice: str, prompt: str) -> str`

`create_wav_header(sample_rate=24000, bits_per_sample=16, channels=1)`

Decoder Module (`decoder.py`)

`initialize_snac_model()`

`convert_to_audio(multiframe)`

`turn_token_into_id(token_string, index)`

`tokens_decoder(token_gen)` (Async)

`tokens_decoder_sync(syn_token_gen)`