Getting Started

OpenVoice

The TTS Technology we currently use is OpenVoice by MyShell AI. This technology is particularly great when it comes to cloning voices. It can also convey emotions which greatly benefits our purposes.

Checkpoints

OpenVoice comes with pretrained speakers and a pretrained Tone Color Converter which allows us to use basic languages and use a sample audio to accurately clone the reference voice.

For our current purposes, we use OpenVoice v1 as the checkpoints there have a speaker that can convey emotions. However, this currently limits us to english or chinese speakers. Particularly to english speakers as those were the only pre-trained speakers that could convey a wider range of emotions.

After building, you can find more to the checkpoints in the tts/checkpoints/ folder in the workspace.

Training

OpenVoice also has a way of training your own speakers: MeloTTS. In the compose.yml, we put another profile that cannot be accessed using the setup-scripts. The training profile's purpose only serves training a model. Therefore, we do not download anything from the normal container here.

To train your own model, you should put a new folder in /tts/training-data for your own speaker.

Add a folder, wavs, which will contain audio data set that you want to train your speaker on.

Finally, create a metadata.list. Here's an example of how the metadata.list should look like:

metadata.list

data/de-neutral/wavs/000.wav|DE-neutral|EN|Schaut, was die Katze hierher gebracht hat!
data/de-neutral/wavs/001.wav|DE-neutral|EN|Die Toilette befindet sich im 1. Stock.
data/de-neutral/wavs/002.wav|DE-neutral|EN|Wieso sollte ich mich in eine Gewürzgurke verwandeln?
data/de-neutral/wavs/003.wav|DE-neutral|EN|Alles kommt eines Tages zu einem Ende. Das Tropfen hört endlich auf.
data/de-neutral/wavs/004.wav|DE-neutral|EN|HelloRIC ist ein Bachelorprojekt an der Universität Bremen. Hier zerbrechen sich 13 Studenten den Kopf darüber, wie sie einen Roboter zusammenbauen können.
data/de-neutral/wavs/005.wav|DE-neutral|EN|KI wird die Welt erobern!

The metadata.list consists of four columns:

The Audio file path.
The speaker name to train on.
The language of the speaker (currently supported: EN, FR, ZH, ES, JP, KR).
The text that is spoken in the Audio.

To start the training, first traverse to the MeloTTS repository.

From there, launch those command lines:

python preprocess_test.py --metadata <path/to/metadata.list>
bash training.sh <path/to/config.json> <num_of_gpus>

The training should commence.

Warning

Note, the training will run indefinitely until a certain amount of runs that is very large. You can always cancel training by pressing CTRL + C.

To test the current training, run:

python infer.py --text "<Your text here>" -m /path/to/checkpoints/G_<iter>.pth -o <output_dir>

Note that the model you trained will only be available in MeloTTS. However, our project supports those types of models directly.

For accessibility reasons, you can even mount the output path of the training to your host-system through the compose by adding ./tts_models:/root/training/MeloTTS/melo/logs as an entry to the volumes-section. The location where melo stores its trained models will always remain the same (unless they make it configurable, of course).