Getting Started
Installation
Get the latest LLM-Server version with your desired method:
- Clone the current repository or download it as
.zip
file - Install docker see Installing Docker
- Inside the repository run
docker compose up --profile=prod -d
- The first startup will take a while
- Stop the server by running
docker compose stop
in the same directory
Installing Docker
If you haven't already, you're going to need to install Docker for this next step.
On Windows:
https://www.docker.com/products/docker-desktop/
or
https://rancherdesktop.io/
On Linux (Ubuntu):
https://docs.docker.com/desktop/install/ubuntu/
Make sure that you have your host environment set up for container GPU passthrough (needed for CUDA operations).
The Service List and their purpose
If you have cloned the repository, you'll likely be overwhelmed by the sheer amount of services that are in the compose file. Here are the key features of each broken down for you:
Service Name | Profile | Description |
---|---|---|
tts-training | training | Used to train a TTS-Model. You can find more on it here. |
llm-server | prod | The LLM-Server production build. Used for actually running the server. |
llm-server | prod-cpu | The LLM-Server production build. Used for actually running the server. This will make the STT, LLM and STT very slow as it depends on the CPU. !!! DOES NOT WORK CURRENTLY !!! OPENVOICE CURRENTLY ASSUMES, CUDA IS AVAILABLE 1 |
llm-server-dev | nvidia | The LLM-Server development environment. Profits of having volumes instead of copying the source code in the image. Used for developing the LLM-Server. |
llm-server-ogpu | ogpu | Another LLM-Server development environment. Supports other GPUs than NVIDIA. !!! DOES NOT WORK CURRENTLY !!! OPENVOICE CURRENTLY ASSUMES, CUDA IS AVAILABLE 1 |
llm-server-cpu | cpu | Another LLM-Server development environment. Uses the CPU instead of a GPU. This will make the STT, LLM and STT very slow! !!! DOES NOT WORK CURRENTLY !!! OPENVOICE CURRENTLY ASSUMES, CUDA IS AVAILABLE 1 |
llm-server-testing | testing | Internal testing environment that runs on the CPU. Will always fail 2 tests, because CUDA is missing. |
llm-server-testing-gpu | testing-gpu | Internal testing environment that requires an NVIDIA GPU. |
For using the LLM-Server itself, we only are interested in the prod
-Profile.
Installing the Docker Container for the first time
If you have cloned the repository, please traverse to the directory of the repository.
On the command-line, where the compose.yml
resides, run:
You can change the values in args
(except for the DOCKER_BUILDKIT arg!) to your liking.
There are also other values that are further explained here.
Setting Ports
If you have cloned the repository, the compose has two sections for this. Ideally, you'll only need the prod-Profile, but you never know which GPU you have available. In any case, you can set the ports like this:
version: "3"
services:
# ...
llm-server:
# ...
args:
# ...
OLLAMA_PORT: 25565
HTTP_PORT: 5000
# ...
# ...
ports:
- 25565:25565 # Ollama
- 5000:5000 # HTTP
# ...
For the Dockerfile
, we have built in arguments for the Ollama/HTTP Port.
We can set those arguments here accordingly at will.
But we need to do a second step and add those ports to the compose file. You can read more on the Dockerfile variables/internals here.
Note
You can technically leave the ports be on the docker image and remap them on the host-machine instead. What you need to do is rewrite the second value in the ports
-section. The port-values on the left site of the colon define the ports on the docker-system while the port-values on the right site of the colon are how the ports are mapped onto the host-machine.
Other GPU (+ CPU-only) support?
If you do not have an NVidia GPU, you can use either the othergpu
profile (currently experimental) to use another GPU (such as AMD's Radeon cards, via ROCm
) or the cpu
profile for CPU-only inference. (Warning: This makes your inference way slower, for testing purposes only.)
Simply replace any prod
profile mentions with one of the other two.
Running your container
Starting the LLM server
You start an already initialized instance by running:
or by simply running inside the installation directory again. This will automatically start a HTTP Server at the desired port. To change the port, change theHTTP_PORT
environment variable in the Dockerfile. By default, this will be accessible at port 5000. The Ollama server will run at port 25565.
Nice to know
The dev
container accesses the installation directory's ollama/models
directory in read-write mode under the virtual path /root/build/models
.
Also it accesses user-defined scripts (under the installation directory: ollama/scripts
) with the virtual path /root/build/scripts
.
(Beware: currently it writes as root
(user id: 1
), meaning, as a normal user, you do not have privileges to edit any files created by the container)
Annotations
Traceback (most recent call last):
llm-server-cpu-1 | File "/root/build/.venv/bin/waitress-serve", line 10, in <module>
llm-server-cpu-1 | sys.exit(run())
llm-server-cpu-1 | File "/root/build/.venv/lib/python3.10/site-packages/waitress/runner.py", line 235, in run
llm-server-cpu-1 | app = pkgutil.resolve_name(args[0])
llm-server-cpu-1 | File "/usr/lib/python3.10/pkgutil.py", line 691, in resolve_name
llm-server-cpu-1 | mod = importlib.import_module(gd['pkg'])
llm-server-cpu-1 | File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
llm-server-cpu-1 | return _bootstrap._gcd_import(name[level:], package, level)
llm-server-cpu-1 | File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
llm-server-cpu-1 | File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
llm-server-cpu-1 | File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
llm-server-cpu-1 | File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
llm-server-cpu-1 | File "<frozen importlib._bootstrap_external>", line 883, in exec_module
llm-server-cpu-1 | File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
llm-server-cpu-1 | File "/root/build/src/route.py", line 48, in <module>
llm-server-cpu-1 | tts = TTSWrapper(reference_speaker=get_resource('tts/references/speaker2.mp3'))
llm-server-cpu-1 | File "/root/build/src/tts.py", line 86, in __init__
llm-server-cpu-1 | self.tone_color_converter = ToneColorConverter(f'{self.converter}/config.json')
llm-server-cpu-1 | File "/root/build/.venv/lib/python3.10/site-packages/openvoice/api.py", line 103, in __init__
llm-server-cpu-1 | super().__init__(*args, **kwargs)
llm-server-cpu-1 | File "/root/build/.venv/lib/python3.10/site-packages/openvoice/api.py", line 19, in __init__
llm-server-cpu-1 | assert torch.cuda.is_available()
llm-server-cpu-1 | AssertionError