mirror of
https://github.com/jasonppy/VoiceCraft.git
synced 2025-06-05 21:49:11 +02:00
add Dockerfile
This commit is contained in:
24
Dockerfile
Normal file
24
Dockerfile
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
FROM jupyter/base-notebook:python-3.9.13
|
||||||
|
|
||||||
|
USER root
|
||||||
|
|
||||||
|
# Install OS dependencies
|
||||||
|
RUN apt-get update && apt-get install -y git-core ffmpeg espeak-ng && \
|
||||||
|
apt-get clean && \
|
||||||
|
rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
# Update Conda, create the voicecraft environment, and install dependencies
|
||||||
|
RUN conda update -y -n base -c conda-forge conda && \
|
||||||
|
conda create -y -n voicecraft python=3.9.16 && \
|
||||||
|
conda run -n voicecraft conda install -y -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068 && \
|
||||||
|
conda run -n voicecraft pip install torch==2.0.1 && \
|
||||||
|
tensorboard==2.16.2 && \
|
||||||
|
phonemizer==3.2.1 && \
|
||||||
|
torchaudio==2.0.2 && \
|
||||||
|
datasets==2.16.0 && \
|
||||||
|
torchmetrics==0.11.1 && \
|
||||||
|
conda run -n voicecraft pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
|
||||||
|
|
||||||
|
# Install the Jupyter kernel
|
||||||
|
RUN conda install -n voicecraft ipykernel --update-deps --force-reinstall -y && \
|
||||||
|
conda run -n voicecraft python -m ipykernel install --name=voicecraft
|
27
README.md
27
README.md
@@ -21,8 +21,8 @@ To clone or edit an unseen voice, VoiceCraft needs only a few seconds of referen
|
|||||||
- [ ] HuggingFace Spaces demo
|
- [ ] HuggingFace Spaces demo
|
||||||
- [ ] Better guidance on training/finetuning
|
- [ ] Better guidance on training/finetuning
|
||||||
|
|
||||||
## How to run TTS inference
|
## How to run TTS inference
|
||||||
There are two ways:
|
There are two ways:
|
||||||
1. with docker. see [quickstart](#quickstart)
|
1. with docker. see [quickstart](#quickstart)
|
||||||
2. without docker. see [envrionment setup](#environment-setup)
|
2. without docker. see [envrionment setup](#environment-setup)
|
||||||
|
|
||||||
@@ -31,7 +31,7 @@ When you are inside the docker image or you have installed all dependencies, Che
|
|||||||
If you want to do model development such as training/finetuning, I recommend following [envrionment setup](#environment-setup) and [training](#training).
|
If you want to do model development such as training/finetuning, I recommend following [envrionment setup](#environment-setup) and [training](#training).
|
||||||
|
|
||||||
## QuickStart
|
## QuickStart
|
||||||
:star: To try out TTS inference with VoiceCraft, the best way is using docker. Thank [@ubergarm](https://github.com/ubergarm) and [@jayc88](https://github.com/jay-c88) for making this happen.
|
:star: To try out TTS inference with VoiceCraft, the best way is using docker. Thank [@ubergarm](https://github.com/ubergarm) and [@jayc88](https://github.com/jay-c88) for making this happen.
|
||||||
|
|
||||||
Tested on Linux and Windows and should work with any host with docker installed.
|
Tested on Linux and Windows and should work with any host with docker installed.
|
||||||
```bash
|
```bash
|
||||||
@@ -43,23 +43,26 @@ cd VoiceCraft
|
|||||||
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.13.5/install-guide.html
|
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.13.5/install-guide.html
|
||||||
# sudo apt-get install -y nvidia-container-toolkit-base || yay -Syu nvidia-container-toolkit || echo etc...
|
# sudo apt-get install -y nvidia-container-toolkit-base || yay -Syu nvidia-container-toolkit || echo etc...
|
||||||
|
|
||||||
# 3. Try to start an existing container otherwise create a new one passing in all GPUs
|
# 3. First build the docker image
|
||||||
|
docker build --tag "voicecraft" .
|
||||||
|
|
||||||
|
# 4. Try to start an existing container otherwise create a new one passing in all GPUs
|
||||||
./start-jupyter.sh # linux
|
./start-jupyter.sh # linux
|
||||||
start-jupyter.bat # windows
|
start-jupyter.bat # windows
|
||||||
|
|
||||||
# 4. now open a webpage on the host box to the URL shown at the bottom of:
|
# 5. now open a webpage on the host box to the URL shown at the bottom of:
|
||||||
docker logs jupyter
|
docker logs jupyter
|
||||||
|
|
||||||
# 5. optionally look inside from another terminal
|
# 6. optionally look inside from another terminal
|
||||||
docker exec -it jupyter /bin/bash
|
docker exec -it jupyter /bin/bash
|
||||||
export USER=(your_linux_username_used_above)
|
export USER=(your_linux_username_used_above)
|
||||||
export HOME=/home/$USER
|
export HOME=/home/$USER
|
||||||
sudo apt-get update
|
sudo apt-get update
|
||||||
|
|
||||||
# 6. confirm video card(s) are visible inside container
|
# 7. confirm video card(s) are visible inside container
|
||||||
nvidia-smi
|
nvidia-smi
|
||||||
|
|
||||||
# 7. Now in browser, open inference_tts.ipynb and work through one cell at a time
|
# 8. Now in browser, open inference_tts.ipynb and work through one cell at a time
|
||||||
echo GOOD LUCK
|
echo GOOD LUCK
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -91,13 +94,13 @@ If you have encountered version issues when running things, checkout [environmen
|
|||||||
Checkout [`inference_speech_editing.ipynb`](./inference_speech_editing.ipynb) and [`inference_tts.ipynb`](./inference_tts.ipynb)
|
Checkout [`inference_speech_editing.ipynb`](./inference_speech_editing.ipynb) and [`inference_tts.ipynb`](./inference_tts.ipynb)
|
||||||
|
|
||||||
## Training
|
## Training
|
||||||
To train an VoiceCraft model, you need to prepare the following parts:
|
To train an VoiceCraft model, you need to prepare the following parts:
|
||||||
1. utterances and their transcripts
|
1. utterances and their transcripts
|
||||||
2. encode the utterances into codes using e.g. Encodec
|
2. encode the utterances into codes using e.g. Encodec
|
||||||
3. convert transcripts into phoneme sequence, and a phoneme set (we named it vocab.txt)
|
3. convert transcripts into phoneme sequence, and a phoneme set (we named it vocab.txt)
|
||||||
4. manifest (i.e. metadata)
|
4. manifest (i.e. metadata)
|
||||||
|
|
||||||
Step 1,2,3 are handled in [./data/phonemize_encodec_encode_hf.py](./data/phonemize_encodec_encode_hf.py), where
|
Step 1,2,3 are handled in [./data/phonemize_encodec_encode_hf.py](./data/phonemize_encodec_encode_hf.py), where
|
||||||
1. Gigaspeech is downloaded through HuggingFace. Note that you need to sign an agreement in order to download the dataset (it needs your auth token)
|
1. Gigaspeech is downloaded through HuggingFace. Note that you need to sign an agreement in order to download the dataset (it needs your auth token)
|
||||||
2. phoneme sequence and encodec codes are also extracted using the script.
|
2. phoneme sequence and encodec codes are also extracted using the script.
|
||||||
|
|
||||||
@@ -119,7 +122,7 @@ python phonemize_encodec_encode_hf.py \
|
|||||||
where encodec_model_path is avaliable [here](https://huggingface.co/pyp1/VoiceCraft). This model is trained on Gigaspeech XL, it has 56M parameters, 4 codebooks, each codebook has 2048 codes. Details are described in our [paper](https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf). If you encounter OOM during extraction, try decrease the batch_size and/or max_len.
|
where encodec_model_path is avaliable [here](https://huggingface.co/pyp1/VoiceCraft). This model is trained on Gigaspeech XL, it has 56M parameters, 4 codebooks, each codebook has 2048 codes. Details are described in our [paper](https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf). If you encounter OOM during extraction, try decrease the batch_size and/or max_len.
|
||||||
The extracted codes, phonemes, and vocab.txt will be stored at `path/to/store_extracted_codes_and_phonemes/${dataset_size}/{encodec_16khz_4codebooks,phonemes,vocab.txt}`.
|
The extracted codes, phonemes, and vocab.txt will be stored at `path/to/store_extracted_codes_and_phonemes/${dataset_size}/{encodec_16khz_4codebooks,phonemes,vocab.txt}`.
|
||||||
|
|
||||||
As for manifest, please download train.txt and validation.txt from [here](https://huggingface.co/datasets/pyp1/VoiceCraft_RealEdit/tree/main), and put them under `path/to/store_extracted_codes_and_phonemes/manifest/`. Please also download vocab.txt from [here](https://huggingface.co/datasets/pyp1/VoiceCraft_RealEdit/tree/main) if you want to use our pretrained VoiceCraft model (so that the phoneme-to-token matching is the same).
|
As for manifest, please download train.txt and validation.txt from [here](https://huggingface.co/datasets/pyp1/VoiceCraft_RealEdit/tree/main), and put them under `path/to/store_extracted_codes_and_phonemes/manifest/`. Please also download vocab.txt from [here](https://huggingface.co/datasets/pyp1/VoiceCraft_RealEdit/tree/main) if you want to use our pretrained VoiceCraft model (so that the phoneme-to-token matching is the same).
|
||||||
|
|
||||||
Now, you are good to start training!
|
Now, you are good to start training!
|
||||||
|
|
||||||
@@ -138,7 +141,7 @@ first install it with `pip install g2p`
|
|||||||
```python
|
```python
|
||||||
from g2p import make_g2p
|
from g2p import make_g2p
|
||||||
transducer = make_g2p('eng', 'eng-ipa')
|
transducer = make_g2p('eng', 'eng-ipa')
|
||||||
transducer("hello").output_string
|
transducer("hello").output_string
|
||||||
# it will output: 'hʌloʊ'
|
# it will output: 'hʌloʊ'
|
||||||
``` -->
|
``` -->
|
||||||
|
|
||||||
|
Reference in New Issue
Block a user