Skip to content

Getting Started

Installation

Get the latest LLM-Server version with your desired method:

  1. Clone the current repository or download it as .zip file
  2. Install docker see Installing Docker
  3. Inside the repository run docker compose up --profile=prod -d
    1. The first startup will take a while
  4. Stop the server by running docker compose stop in the same directory

Installing Docker

If you haven't already, you're going to need to install Docker for this next step.

On Windows:

https://www.docker.com/products/docker-desktop/

or

https://rancherdesktop.io/

On Linux (Ubuntu):

https://docs.docker.com/desktop/install/ubuntu/

Make sure that you have your host environment set up for container GPU passthrough (needed for CUDA operations).

The Service List and their purpose

If you have cloned the repository, you'll likely be overwhelmed by the sheer amount of services that are in the compose file. Here are the key features of each broken down for you:

Service Name Profile Description
tts-training training Used to train a TTS-Model. You can find more on it here.
llm-server prod The LLM-Server production build. Used for actually running the server.
llm-server prod-cpu The LLM-Server production build. Used for actually running the server. This will make the STT, LLM and STT very slow as it depends on the CPU.
!!! DOES NOT WORK CURRENTLY !!!
OPENVOICE CURRENTLY ASSUMES, CUDA IS AVAILABLE 1
llm-server-dev nvidia The LLM-Server development environment. Profits of having volumes instead of copying the source code in the image. Used for developing the LLM-Server.
llm-server-ogpu ogpu Another LLM-Server development environment. Supports other GPUs than NVIDIA.
!!! DOES NOT WORK CURRENTLY !!!
OPENVOICE CURRENTLY ASSUMES, CUDA IS AVAILABLE 1
llm-server-cpu cpu Another LLM-Server development environment. Uses the CPU instead of a GPU. This will make the STT, LLM and STT very slow!
!!! DOES NOT WORK CURRENTLY !!!
OPENVOICE CURRENTLY ASSUMES, CUDA IS AVAILABLE 1
llm-server-testing testing Internal testing environment that runs on the CPU.
Will always fail 2 tests, because CUDA is missing.
llm-server-testing-gpu testing-gpu Internal testing environment that requires an NVIDIA GPU.

For using the LLM-Server itself, we only are interested in the prod-Profile.

Installing the Docker Container for the first time

If you have cloned the repository, please traverse to the directory of the repository.

On the command-line, where the compose.yml resides, run:

docker compose --profile=prod up -d
docker compose --profile=prod-cpu up -d

You can change the values in args (except for the DOCKER_BUILDKIT arg!) to your liking. There are also other values that are further explained here.

Setting Ports

If you have cloned the repository, the compose has two sections for this. Ideally, you'll only need the prod-Profile, but you never know which GPU you have available. In any case, you can set the ports like this:

    version: "3"
    services:
    # ...
        llm-server:
        # ...
        args:
            # ...
            OLLAMA_PORT: 25565
            HTTP_PORT: 5000
            # ...
        # ...
        ports:
        - 25565:25565 # Ollama
        - 5000:5000 # HTTP
        # ...

For the Dockerfile, we have built in arguments for the Ollama/HTTP Port. We can set those arguments here accordingly at will. But we need to do a second step and add those ports to the compose file. You can read more on the Dockerfile variables/internals here.

Note

You can technically leave the ports be on the docker image and remap them on the host-machine instead. What you need to do is rewrite the second value in the ports-section. The port-values on the left site of the colon define the ports on the docker-system while the port-values on the right site of the colon are how the ports are mapped onto the host-machine.

Other GPU (+ CPU-only) support?

If you do not have an NVidia GPU, you can use either the othergpu profile (currently experimental) to use another GPU (such as AMD's Radeon cards, via ROCm) or the cpu profile for CPU-only inference. (Warning: This makes your inference way slower, for testing purposes only.)

Simply replace any prod profile mentions with one of the other two.

Running your container

Starting the LLM server

You start an already initialized instance by running:

docker compose start llm-server -d
or by simply running
docker compose --profile=prod up -d
inside the installation directory again. This will automatically start a HTTP Server at the desired port. To change the port, change the HTTP_PORT environment variable in the Dockerfile. By default, this will be accessible at port 5000. The Ollama server will run at port 25565.

Nice to know

The dev container accesses the installation directory's ollama/models directory in read-write mode under the virtual path /root/build/models. Also it accesses user-defined scripts (under the installation directory: ollama/scripts) with the virtual path /root/build/scripts. (Beware: currently it writes as root (user id: 1), meaning, as a normal user, you do not have privileges to edit any files created by the container)

Annotations

Traceback (most recent call last):
llm-server-cpu-1  |   File "/root/build/.venv/bin/waitress-serve", line 10, in <module>
llm-server-cpu-1  |     sys.exit(run())
llm-server-cpu-1  |   File "/root/build/.venv/lib/python3.10/site-packages/waitress/runner.py", line 235, in run
llm-server-cpu-1  |     app = pkgutil.resolve_name(args[0])
llm-server-cpu-1  |   File "/usr/lib/python3.10/pkgutil.py", line 691, in resolve_name
llm-server-cpu-1  |     mod = importlib.import_module(gd['pkg'])
llm-server-cpu-1  |   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
llm-server-cpu-1  |     return _bootstrap._gcd_import(name[level:], package, level)
llm-server-cpu-1  |   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
llm-server-cpu-1  |   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
llm-server-cpu-1  |   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
llm-server-cpu-1  |   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
llm-server-cpu-1  |   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
llm-server-cpu-1  |   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
llm-server-cpu-1  |   File "/root/build/src/route.py", line 48, in <module>
llm-server-cpu-1  |     tts = TTSWrapper(reference_speaker=get_resource('tts/references/speaker2.mp3'))
llm-server-cpu-1  |   File "/root/build/src/tts.py", line 86, in __init__
llm-server-cpu-1  |     self.tone_color_converter = ToneColorConverter(f'{self.converter}/config.json')
llm-server-cpu-1  |   File "/root/build/.venv/lib/python3.10/site-packages/openvoice/api.py", line 103, in __init__
llm-server-cpu-1  |     super().__init__(*args, **kwargs)
llm-server-cpu-1  |   File "/root/build/.venv/lib/python3.10/site-packages/openvoice/api.py", line 19, in __init__
llm-server-cpu-1  |     assert torch.cuda.is_available()
llm-server-cpu-1  | AssertionError