Code Documentation
This document provides an overview of the stt_node.py script, which is the core of the ros_stt package.
SpeechToTextNode
The SpeechToTextNode class is a ROS2 node that acts as a client to an OpenAI-compatible STT server. It exposes a ROS2 service to transcribe audio into text.
Parameters
The node exposes the following ROS2 parameter:
| Parameter | Type | Description | Default Value |
|---|---|---|---|
server_url |
string | The URL of the whisper.cpp server endpoint. | http://localhost:8080/inference |
Services
The node provides one main service:
/stt
- Type:
ric_messages/srv/AudioBytesToText - Description: This service takes a raw audio byte array and returns the transcribed text along with the detected language.
- Request:
audio(uint8[]): The raw audio data to be transcribed.
- Response:
text(string): The transcribed text from the audio.language(string): The language automatically detected by the server.
How it Works
- Initialization: The node starts, declares its
server_urlparameter, and creates the/sttservice. - Service Call: Another ROS2 node calls the
/sttservice with a request containing the raw audio data as auint8array. - Callback Execution: The
speech_to_text_callbackmethod is triggered. - Data Preparation: The incoming
uint8array is wrapped in anio.BytesIOobject to be sent as a file in an HTTP request. - API Request: The node sends the audio data in a
multipart/form-dataPOST request to thewhisper.cppserver URL. It specifically requests averbose_jsonresponse to ensure it receives the detected language in addition to the text. - Response Handling:
- If the server returns a successful response (HTTP 200), the node parses the JSON payload.
- It extracts the
textandlanguagefields from the response. - The extracted data is populated into the ROS service response object.
- If the server returns an error or the transcription is empty, an appropriate error or warning is logged.
- Return to Caller: The ROS service response, containing the text and language, is returned to the original caller.