Code Documentation
This document provides an overview of the code of the tts
package.
TextToSpeechNode
The TextToSpeechNode
class is a ROS2 node that acts as a client to an OpenAI-compatible TTS server.
We use OrpheusTTS, which models are in the GGUF format, so they are usable in Llama.CPP, which is usually used for LLMs, but we can also use Llama.CPP as an TTS server instead.
Note: The node leverages the SNAC (Scalable Neural Audio Codec) model for audio decoding and supports streaming audio generation for real-time text-to-speech conversion, since Llama.CPP doesn't support TTS natively.
Parameters
The node exposes the following ROS2 parameters:
Parameter | Type | Description | Default Value |
---|---|---|---|
server_url |
string | The URL of the Llama.CCP server's completions endpoint for TTS inference. | http://localhost:8080/v1/completions |
en_model |
string | Model identifier for English TTS. | en |
en_voice |
string | Voice profile to use for English text-to-speech. | leah |
en_max_tokens |
integer | Maximum number of tokens to generate for English TTS. | 10240 |
en_temperature |
double | Controls randomness in English TTS generation. Higher values increase creativity. | 0.6 |
en_top_p |
double | Nucleus sampling parameter for English TTS. Controls diversity of token selection. | 0.9 |
en_repeat_penalty |
double | Penalty for token repetition in English TTS to encourage more varied output. | 1.1 |
de_model |
string | Model identifier for German TTS. | de |
de_voice |
string | Voice profile to use for German text-to-speech. | max |
de_max_tokens |
integer | Maximum number of tokens to generate for German TTS. | 10240 |
de_temperature |
double | Controls randomness in German TTS generation. Higher values increase creativity. | 0.6 |
de_top_p |
double | Nucleus sampling parameter for German TTS. Controls diversity of token selection. | 0.9 |
de_repeat_penalty |
double | Penalty for token repetition in German TTS to encourage more varied output. | 1.1 |
SNAC Model Initialization
The node initializes the SNAC (Scalable Neural Audio Codec) model during startup. SNAC is used to decode the audio tokens generated by the TTS model into raw audio data. The model automatically selects CUDA if available, otherwise falls back to CPU processing.
Services
The node provides one main service:
/tts
- Type:
ric_messages/srv/TextToAudioBytes
- Description: This is the main service for converting text to audio. It takes text input and a language specification, then returns the generated audio as WAV-formatted bytes.
- Request:
text
(string): The text to convert to speech.language
(string): The target language for synthesis. Supports "english"/"en" and "german"/"de".
- Response:
audio
(bytes): The generated audio data in WAV format, ready for playback or further processing.
How it Works
-
Initialization: The node starts, declares its parameters for both English and German TTS, initializes the SNAC model, and creates the TTS service.
-
Service Call: Another ROS2 node calls the
/tts
service with text and language parameters. -
Language Processing: The
text_to_speech_callback
is triggered. It normalizes the language parameter (converting "english" to "en" and "german" to "de") and validates that the language is supported. -
Parameter Retrieval: The node retrieves the appropriate model parameters based on the requested language (model name, voice, temperature, etc.).
-
Prompt Building: The text is formatted using the
build_prompt
helper function, which wraps the input text with the appropriate voice tags and special tokens required by the OrpheusTTS model. -
Streaming Generation: The node sends a streaming request to the Llama.CPP server via
_generate_response()
: - Sends an HTTP POST request with the formatted prompt and generation parameters
- Processes the server-sent events (SSE) stream response
-
Filters the response to extract only tokens containing audio data (custom tokens)
-
Real-time Audio Decoding: As audio tokens are generated:
- The
tokens_decoder_sync
function processes the token stream - Tokens are converted to audio codes and passed to the SNAC model
- The SNAC model decodes the codes into raw audio samples
-
Audio samples are converted to 16-bit PCM format and yielded as byte chunks
-
WAV File Assembly:
- A WAV header is created using
create_wav_header()
- Audio byte chunks are collected and combined with the header
-
The complete WAV file is returned as the service response
-
Error Handling: The node includes comprehensive error handling for network issues, invalid responses, unsupported languages, and audio generation failures.
Key Features
- Streaming Audio Generation: Audio is generated and returned in real-time as tokens are produced, enabling low-latency TTS.
- Multi-language Support: Supports both English and German with separate parameter sets for each language, but other available languages in OrpheusTTS can be used as well.
- Flexible Voice Selection: Different voice profiles can be configured for each language.
- SNAC Audio Decoding: Uses state-of-the-art neural audio codec for high-quality audio synthesis.
- WAV Format Output: Returns standard WAV-formatted audio compatible with most audio systems.
- Robust Error Handling: Comprehensive error checking and logging throughout the pipeline.
Helper Module (helper.py
)
The helper module provides essential utility functions for the TTS system, handling prompt formatting, WAV file creation, and token validation.
Functions
string_contains_token(string: str) -> bool
- Description: Checks if a string contains any custom audio token using regex pattern matching.
- Parameters:
string
(str): The input string to check for custom tokens- Returns:
bool
- True if the string contains custom tokens, False otherwise - Usage: Used to filter streaming responses and identify chunks containing audio data
build_prompt(voice: str, prompt: str) -> str
- Description: Constructs the properly formatted prompt string required by the OrpheusTTS model, wrapping the input text with voice tags and special tokens.
- Parameters:
voice
(str): The voice profile to use (e.g., "leah", "max")prompt
(str): The text content to be converted to speech- Returns:
str
- The formatted prompt string with OrpheusTTS-specific tokens - Format:
<custom_token_3>{voice}: {prompt}<|eot_id|><custom_token_4><custom_token_5><custom_token_1>
create_wav_header(sample_rate=24000, bits_per_sample=16, channels=1)
- Description: Creates a standard WAV file header with the specified audio parameters. This function is adapted from the OrpheusTTS project.
- Parameters:
sample_rate
(int): Audio sample rate in Hz (default: 24000)bits_per_sample
(int): Bit depth of audio samples (default: 16)channels
(int): Number of audio channels (default: 1 for mono)- Returns:
bytes
- The WAV header as a byte string - Technical Details: Uses struct.pack to create a proper RIFF/WAVE header format
Constants
AUDIO_TOKENS_REGEX
: Regular expression patternr"<custom_token_(\d+)>"
used to identify custom audio tokens in the streaming response
Decoder Module (decoder.py
)
The decoder module handles the conversion of TTS model tokens into actual audio using the SNAC (Scalable Neural Audio Codec) model.
This module is adapted from the OrpheusTTS project and serves as a temporary solution until Llama.CPP gains native TTS support.
Global Variables
model
: The global SNAC model instance used for audio decodingsnac_device
: The device (CPU/CUDA) where the SNAC model is loaded
Functions
initialize_snac_model()
- Description: Initializes the global SNAC model for audio decoding. Automatically detects and uses CUDA if available, otherwise falls back to CPU.
- Device Selection: Uses the
SNAC_DEVICE
environment variable or auto-detects the best available device - Model: Loads the pre-trained SNAC model from "hubertsiuzdak/snac_24khz"
convert_to_audio(multiframe)
- Description: Converts a sequence of audio codes into raw audio bytes using the SNAC model.
- Parameters:
multiframe
: List of audio codes representing frames to be decoded- Returns:
bytes
- Raw audio data in 16-bit PCM format, or None if conversion fails - Process:
- Validates that the multiframe contains at least 7 codes
- Organizes codes into three hierarchical levels (codes_0, codes_1, codes_2)
- Performs bounds checking to ensure all codes are within valid range (0-4096)
- Uses SNAC model to decode codes into audio waveform
- Converts float audio to 16-bit integer format and returns as bytes
turn_token_into_id(token_string, index)
- Description: Extracts and converts custom tokens from the streaming response into audio code IDs.
- Parameters:
token_string
(str): String containing the custom tokenindex
(int): Current position in the token sequence- Returns:
int
- The audio code ID, or None if parsing fails - Logic: Extracts the numeric part from custom tokens and applies mathematical transformation based on the index position
tokens_decoder(token_gen)
(Async)
- Description: Asynchronous generator that processes a stream of tokens and yields audio chunks in real-time.
- Parameters:
token_gen
: Async generator yielding token strings- Yields:
bytes
- Audio data chunks as they become available - Buffering Strategy:
- Maintains a buffer of audio codes
- Processes codes in groups of 7 (representing one audio frame)
- Yields audio when buffer contains at least 28 codes (4 frames)
- Uses overlapping windows for smooth audio generation
tokens_decoder_sync(syn_token_gen)
- Description: Synchronous wrapper around the async
tokens_decoder
function, enabling integration with synchronous code. - Parameters:
syn_token_gen
: Synchronous generator yielding token strings- Yields:
bytes
- Audio data chunks - Implementation:
- Converts synchronous generator to async generator
- Runs async decoder in a separate thread
- Uses a queue to bridge async and sync worlds
- Returns audio chunks as they become available