Code Documentation

This document provides an overview of the stt_node.py script, which is the core of the ros_stt package.

`SpeechToTextNode`

The SpeechToTextNode class is a ROS2 node that acts as a client to an OpenAI-compatible STT server. It exposes a ROS2 service to transcribe audio into text.

Parameters

The node exposes the following ROS2 parameter:

Parameter	Type	Description	Default Value
`server_url`	string	The URL of the whisper.cpp server endpoint.	`http://localhost:8080/inference`

Services

The node provides one main service:

`/stt`

Type: ric_messages/srv/AudioBytesToText
Description: This service takes a raw audio byte array and returns the transcribed text along with the detected language.
Request:
- audio (uint8[]): The raw audio data to be transcribed.
Response:
- text (string): The transcribed text from the audio.
- language (string): The language automatically detected by the server.

How it Works

Initialization: The node starts, declares its server_url parameter, and creates the /stt service.
Service Call: Another ROS2 node calls the /stt service with a request containing the raw audio data as a uint8 array.
Callback Execution: The speech_to_text_callback method is triggered.
Data Preparation: The incoming uint8 array is wrapped in an io.BytesIO object to be sent as a file in an HTTP request.
API Request: The node sends the audio data in a multipart/form-data POST request to the whisper.cpp server URL. It specifically requests a verbose_json response to ensure it receives the detected language in addition to the text.
Response Handling:
- If the server returns a successful response (HTTP 200), the node parses the JSON payload.
- It extracts the text and language fields from the response.
- The extracted data is populated into the ROS service response object.
- If the server returns an error or the transcription is empty, an appropriate error or warning is logged.
Return to Caller: The ROS service response, containing the text and language, is returned to the original caller.