Skip to content

Code Documentation

This document provides an overview of the stt_node.py script, which is the core of the ros_stt package.

SpeechToTextNode

The SpeechToTextNode class is a ROS2 node that acts as a client to an OpenAI-compatible STT server. It exposes a ROS2 service to transcribe audio into text.

Parameters

The node exposes the following ROS2 parameter:

Parameter Type Description Default Value
server_url string The URL of the whisper.cpp server endpoint. http://localhost:8080/inference

Services

The node provides one main service:

/stt

  • Type: ric_messages/srv/AudioBytesToText
  • Description: This service takes a raw audio byte array and returns the transcribed text along with the detected language.
  • Request:
    • audio (uint8[]): The raw audio data to be transcribed.
  • Response:
    • text (string): The transcribed text from the audio.
    • language (string): The language automatically detected by the server.

How it Works

  1. Initialization: The node starts, declares its server_url parameter, and creates the /stt service.
  2. Service Call: Another ROS2 node calls the /stt service with a request containing the raw audio data as a uint8 array.
  3. Callback Execution: The speech_to_text_callback method is triggered.
  4. Data Preparation: The incoming uint8 array is wrapped in an io.BytesIO object to be sent as a file in an HTTP request.
  5. API Request: The node sends the audio data in a multipart/form-data POST request to the whisper.cpp server URL. It specifically requests a verbose_json response to ensure it receives the detected language in addition to the text.
  6. Response Handling:
    • If the server returns a successful response (HTTP 200), the node parses the JSON payload.
    • It extracts the text and language fields from the response.
    • The extracted data is populated into the ROS service response object.
    • If the server returns an error or the transcription is empty, an appropriate error or warning is logged.
  7. Return to Caller: The ROS service response, containing the text and language, is returned to the original caller.