Skip to content

Code Documentation

This document provides an overview of the code of the tts package.

TextToSpeechNode

The TextToSpeechNode class is a ROS2 node that acts as a client to an OpenAI-compatible TTS server. We use OrpheusTTS, which models are in the GGUF format, so they are usable in Llama.CPP, which is usually used for LLMs, but we can also use Llama.CPP as an TTS server instead. Note: The node leverages the SNAC (Scalable Neural Audio Codec) model for audio decoding and supports streaming audio generation for real-time text-to-speech conversion, since Llama.CPP doesn't support TTS natively.

Parameters

The node exposes the following ROS2 parameters:

Parameter Type Description Default Value
server_url string The URL of the Llama.CCP server's completions endpoint for TTS inference. http://localhost:8080/v1/completions
en_model string Model identifier for English TTS. en
en_voice string Voice profile to use for English text-to-speech. leah
en_max_tokens integer Maximum number of tokens to generate for English TTS. 10240
en_temperature double Controls randomness in English TTS generation. Higher values increase creativity. 0.6
en_top_p double Nucleus sampling parameter for English TTS. Controls diversity of token selection. 0.9
en_repeat_penalty double Penalty for token repetition in English TTS to encourage more varied output. 1.1
de_model string Model identifier for German TTS. de
de_voice string Voice profile to use for German text-to-speech. max
de_max_tokens integer Maximum number of tokens to generate for German TTS. 10240
de_temperature double Controls randomness in German TTS generation. Higher values increase creativity. 0.6
de_top_p double Nucleus sampling parameter for German TTS. Controls diversity of token selection. 0.9
de_repeat_penalty double Penalty for token repetition in German TTS to encourage more varied output. 1.1

SNAC Model Initialization

The node initializes the SNAC (Scalable Neural Audio Codec) model during startup. SNAC is used to decode the audio tokens generated by the TTS model into raw audio data. The model automatically selects CUDA if available, otherwise falls back to CPU processing.

Services

The node provides one main service:

/tts

  • Type: ric_messages/srv/TextToAudioBytes
  • Description: This is the main service for converting text to audio. It takes text input and a language specification, then returns the generated audio as WAV-formatted bytes.
  • Request:
    • text (string): The text to convert to speech.
    • language (string): The target language for synthesis. Supports "english"/"en" and "german"/"de".
  • Response:
    • audio (bytes): The generated audio data in WAV format, ready for playback or further processing.

How it Works

  1. Initialization: The node starts, declares its parameters for both English and German TTS, initializes the SNAC model, and creates the TTS service.

  2. Service Call: Another ROS2 node calls the /tts service with text and language parameters.

  3. Language Processing: The text_to_speech_callback is triggered. It normalizes the language parameter (converting "english" to "en" and "german" to "de") and validates that the language is supported.

  4. Parameter Retrieval: The node retrieves the appropriate model parameters based on the requested language (model name, voice, temperature, etc.).

  5. Prompt Building: The text is formatted using the build_prompt helper function, which wraps the input text with the appropriate voice tags and special tokens required by the OrpheusTTS model.

  6. Streaming Generation: The node sends a streaming request to the Llama.CPP server via _generate_response():

  7. Sends an HTTP POST request with the formatted prompt and generation parameters
  8. Processes the server-sent events (SSE) stream response
  9. Filters the response to extract only tokens containing audio data (custom tokens)

  10. Real-time Audio Decoding: As audio tokens are generated:

  11. The tokens_decoder_sync function processes the token stream
  12. Tokens are converted to audio codes and passed to the SNAC model
  13. The SNAC model decodes the codes into raw audio samples
  14. Audio samples are converted to 16-bit PCM format and yielded as byte chunks

  15. WAV File Assembly:

  16. A WAV header is created using create_wav_header()
  17. Audio byte chunks are collected and combined with the header
  18. The complete WAV file is returned as the service response

  19. Error Handling: The node includes comprehensive error handling for network issues, invalid responses, unsupported languages, and audio generation failures.

Key Features

  • Streaming Audio Generation: Audio is generated and returned in real-time as tokens are produced, enabling low-latency TTS.
  • Multi-language Support: Supports both English and German with separate parameter sets for each language, but other available languages in OrpheusTTS can be used as well.
  • Flexible Voice Selection: Different voice profiles can be configured for each language.
  • SNAC Audio Decoding: Uses state-of-the-art neural audio codec for high-quality audio synthesis.
  • WAV Format Output: Returns standard WAV-formatted audio compatible with most audio systems.
  • Robust Error Handling: Comprehensive error checking and logging throughout the pipeline.

Helper Module (helper.py)

The helper module provides essential utility functions for the TTS system, handling prompt formatting, WAV file creation, and token validation.

Functions

string_contains_token(string: str) -> bool

  • Description: Checks if a string contains any custom audio token using regex pattern matching.
  • Parameters:
  • string (str): The input string to check for custom tokens
  • Returns: bool - True if the string contains custom tokens, False otherwise
  • Usage: Used to filter streaming responses and identify chunks containing audio data

build_prompt(voice: str, prompt: str) -> str

  • Description: Constructs the properly formatted prompt string required by the OrpheusTTS model, wrapping the input text with voice tags and special tokens.
  • Parameters:
  • voice (str): The voice profile to use (e.g., "leah", "max")
  • prompt (str): The text content to be converted to speech
  • Returns: str - The formatted prompt string with OrpheusTTS-specific tokens
  • Format: <custom_token_3>{voice}: {prompt}<|eot_id|><custom_token_4><custom_token_5><custom_token_1>

create_wav_header(sample_rate=24000, bits_per_sample=16, channels=1)

  • Description: Creates a standard WAV file header with the specified audio parameters. This function is adapted from the OrpheusTTS project.
  • Parameters:
  • sample_rate (int): Audio sample rate in Hz (default: 24000)
  • bits_per_sample (int): Bit depth of audio samples (default: 16)
  • channels (int): Number of audio channels (default: 1 for mono)
  • Returns: bytes - The WAV header as a byte string
  • Technical Details: Uses struct.pack to create a proper RIFF/WAVE header format

Constants

  • AUDIO_TOKENS_REGEX: Regular expression pattern r"<custom_token_(\d+)>" used to identify custom audio tokens in the streaming response

Decoder Module (decoder.py)

The decoder module handles the conversion of TTS model tokens into actual audio using the SNAC (Scalable Neural Audio Codec) model.

This module is adapted from the OrpheusTTS project and serves as a temporary solution until Llama.CPP gains native TTS support.

Global Variables

  • model: The global SNAC model instance used for audio decoding
  • snac_device: The device (CPU/CUDA) where the SNAC model is loaded

Functions

initialize_snac_model()

  • Description: Initializes the global SNAC model for audio decoding. Automatically detects and uses CUDA if available, otherwise falls back to CPU.
  • Device Selection: Uses the SNAC_DEVICE environment variable or auto-detects the best available device
  • Model: Loads the pre-trained SNAC model from "hubertsiuzdak/snac_24khz"

convert_to_audio(multiframe)

  • Description: Converts a sequence of audio codes into raw audio bytes using the SNAC model.
  • Parameters:
  • multiframe: List of audio codes representing frames to be decoded
  • Returns: bytes - Raw audio data in 16-bit PCM format, or None if conversion fails
  • Process:
  • Validates that the multiframe contains at least 7 codes
  • Organizes codes into three hierarchical levels (codes_0, codes_1, codes_2)
  • Performs bounds checking to ensure all codes are within valid range (0-4096)
  • Uses SNAC model to decode codes into audio waveform
  • Converts float audio to 16-bit integer format and returns as bytes

turn_token_into_id(token_string, index)

  • Description: Extracts and converts custom tokens from the streaming response into audio code IDs.
  • Parameters:
  • token_string (str): String containing the custom token
  • index (int): Current position in the token sequence
  • Returns: int - The audio code ID, or None if parsing fails
  • Logic: Extracts the numeric part from custom tokens and applies mathematical transformation based on the index position

tokens_decoder(token_gen) (Async)

  • Description: Asynchronous generator that processes a stream of tokens and yields audio chunks in real-time.
  • Parameters:
  • token_gen: Async generator yielding token strings
  • Yields: bytes - Audio data chunks as they become available
  • Buffering Strategy:
  • Maintains a buffer of audio codes
  • Processes codes in groups of 7 (representing one audio frame)
  • Yields audio when buffer contains at least 28 codes (4 frames)
  • Uses overlapping windows for smooth audio generation

tokens_decoder_sync(syn_token_gen)

  • Description: Synchronous wrapper around the async tokens_decoder function, enabling integration with synchronous code.
  • Parameters:
  • syn_token_gen: Synchronous generator yielding token strings
  • Yields: bytes - Audio data chunks
  • Implementation:
  • Converts synchronous generator to async generator
  • Runs async decoder in a separate thread
  • Uses a queue to bridge async and sync worlds
  • Returns audio chunks as they become available