Running with Docker
This project can be run using Docker and Docker Compose. Install it from here if not already available.
There are two separate configurations available: one for running with NVIDIA GPU support and another for CPU-only execution.
IMPORTANT: Make sure to also clone the ric-messages
git submodule located in src
folder with:
With GPU Support
To run the application with GPU acceleration, you will need to have the NVIDIA Container Toolkit installed on your system.
Once you have the toolkit installed, you can run the application using the following command:
This will build and run the llm
and llm-node
services.
The llm
service will automatically download the specified model and start the Llama.CPP server with GPU support.
Important: Do note that the ROS2 node makes use of rmw_zenoh
for ROS2 communication. Use the provided zenoh_router for this purpose.
CPU-Only
If you do not have a compatible NVIDIA GPU, you can run the application in CPU-only mode.
To do this, use the compose.cpu.yaml
file:
This will start the same services, but the llm
service will be configured to run entirely on the CPU.
Note that the execution time using CPU-only will be very slow.
Services
The Docker Compose configurations define two main services: llm
and llm-node
.
The llm
Service
This service is responsible for running the Llama.CPP server, which provides the core language model inference capabilities.
- The
llm
service uses a pre-built Docker image fromghcr.io/ggml-org/llama.cpp
(server-cuda
for GPU,server
for CPU). - It mounts
/root/.cache/llama.cpp
to.models/llm
, so all auto-downloaded models are stored on the local file system and don't need to be re-downloaded when a container is recreated. - The server exposes an OpenAI-compatible API endpoint, which the
llm-node
service communicates with. - A healthcheck runs every 30 seconds to ensure the
llm-node
starts only after the server is running. - Check the documentation for llama-server for all available arguments.
- With the default settings, we use the quantized version of Gemma3 from Unsloth with the recommended settings for Llama.CPP.
Environment
Variable | Description | Default Value |
---|---|---|
LLAMACPP_MODEL_NAME |
The name of the model to download from Hugging Face. | unsloth/gemma-3-12b-it-qat-GGUF:Q4_K_M |
LLAMACPP_CONTEXT_LENGTH |
The context length of the LLM. | 16384 |
LLAMACPP_N_GPU_LAYERS |
The number of layers to offload to the GPU. | 49 |
The llm-node
Service
This service runs the ROS2 client node that acts as a bridge between the ROS2 ecosystem and the llm
service.
- Uses
harbor.hb.dfki.de/helloric/ros_llm:latest
(VPN required) or builds from the local Dockerfile - The node provides a ROS2 service at
/llm
that allows other ROS2 nodes to send prompts and receive completions from the language model. - It also offers a
/clear_history
service to reset the conversation. - It communicates with the
llm
service over the internal Docker network. - It is configured to start only after the
llm
service is healthy and running. - It uses Zenoh as RMW implementation by default. To change it, refer to the
zenoh_router
documentation.
Environment
Variable | Description | Default Value |
---|---|---|
LLAMACPP_URL |
URL of the Llama.cpp server. | http://llm:8080/v1/chat/completions |
PYTHONUNBUFFERED |
Prevents Python from buffering stdout and stderr. | 1 |
RMW_IMPLEMENTATION |
ROS2 middleware implementation. | rmw_zenoh_cpp |
ROS_AUTOMATIC_DISCOVERY_RANGE |
Disables automatic discovery in ROS2. | OFF |
ZENOH_ROUTER_CHECK_ATTEMPTS |
Number of attempts to check for Zenoh router. 0 means wait indefinitely. |
0 |
ZENOH_CONFIG_OVERRIDE |
Zenoh configuration override, see rmw_zenoh. | mode="client";connect/endpoints=["tcp/host.docker.internal:7447"] |