revised env setup, random seed effective
This commit is contained in:
parent
2f78e8d435
commit
991b1fe3bb
14
README.md
14
README.md
|
@ -19,7 +19,7 @@ To clone or edit an unseen voice, VoiceCraft needs only a few seconds of referen
|
||||||
- [x] Model weights (both 330M and 830M, the former seems to be just as good)
|
- [x] Model weights (both 330M and 830M, the former seems to be just as good)
|
||||||
- [ ] Write colab notebooks for better hands-on experience
|
- [ ] Write colab notebooks for better hands-on experience
|
||||||
- [ ] HuggingFace Spaces demo
|
- [ ] HuggingFace Spaces demo
|
||||||
- [ ] Better guidance on training
|
- [ ] Better guidance on training/finetuning
|
||||||
|
|
||||||
## How to run TTS inference
|
## How to run TTS inference
|
||||||
There are two ways:
|
There are two ways:
|
||||||
|
@ -28,6 +28,8 @@ There are two ways:
|
||||||
|
|
||||||
When you are inside the docker image or you have installed all dependencies, Checkout [`inference_tts.ipynb`](./inference_tts.ipynb).
|
When you are inside the docker image or you have installed all dependencies, Checkout [`inference_tts.ipynb`](./inference_tts.ipynb).
|
||||||
|
|
||||||
|
If you want to do model development such as training/finetuning, I recommend following [envrionment setup](#environment-setup) and [training](#training).
|
||||||
|
|
||||||
## QuickStart
|
## QuickStart
|
||||||
:star: To try out TTS inference with VoiceCraft, the best way is using docker. Thank [@ubergarm](https://github.com/ubergarm) and [@jayc88](https://github.com/jay-c88) for making this happen.
|
:star: To try out TTS inference with VoiceCraft, the best way is using docker. Thank [@ubergarm](https://github.com/ubergarm) and [@jayc88](https://github.com/jay-c88) for making this happen.
|
||||||
|
|
||||||
|
@ -66,13 +68,13 @@ echo GOOD LUCK
|
||||||
conda create -n voicecraft python=3.9.16
|
conda create -n voicecraft python=3.9.16
|
||||||
conda activate voicecraft
|
conda activate voicecraft
|
||||||
|
|
||||||
pip install torch==2.0.1 # this assumes your system is compatible with CUDA 11.7, otherwise checkout https://pytorch.org/get-started/previous-versions/#v201
|
|
||||||
apt-get install ffmpeg # if you don't already have ffmpeg installed
|
|
||||||
pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
|
pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
|
||||||
|
pip install xformers==0.0.22
|
||||||
|
pip install torchaudio==2.0.2 torch==2.0.1 # this assumes your system is compatible with CUDA 11.7, otherwise checkout https://pytorch.org/get-started/previous-versions/#v201
|
||||||
|
apt-get install ffmpeg # if you don't already have ffmpeg installed
|
||||||
apt-get install espeak-ng # backend for the phonemizer installed below
|
apt-get install espeak-ng # backend for the phonemizer installed below
|
||||||
pip install tensorboard==2.16.2
|
pip install tensorboard==2.16.2
|
||||||
pip install phonemizer==3.2.1
|
pip install phonemizer==3.2.1
|
||||||
pip install torchaudio==2.0.2
|
|
||||||
pip install datasets==2.16.0
|
pip install datasets==2.16.0
|
||||||
pip install torchmetrics==0.11.1
|
pip install torchmetrics==0.11.1
|
||||||
# install MFA for getting forced-alignment, this could take a few minutes
|
# install MFA for getting forced-alignment, this could take a few minutes
|
||||||
|
@ -80,7 +82,7 @@ conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=
|
||||||
# conda install pocl # above gives an warning for installing pocl, not sure if really need this
|
# conda install pocl # above gives an warning for installing pocl, not sure if really need this
|
||||||
|
|
||||||
# to run ipynb
|
# to run ipynb
|
||||||
conda install -n voicecraft ipykernel --update-deps --force-reinstall
|
conda install -n voicecraft ipykernel --no-deps --force-reinstall
|
||||||
```
|
```
|
||||||
|
|
||||||
If you have encountered version issues when running things, checkout [environment.yml](./environment.yml) for exact matching.
|
If you have encountered version issues when running things, checkout [environment.yml](./environment.yml) for exact matching.
|
||||||
|
@ -129,7 +131,7 @@ bash e830M.sh
|
||||||
|
|
||||||
|
|
||||||
## License
|
## License
|
||||||
The codebase is under CC BY-NC-SA 4.0 ([LICENSE-CODE](./LICENSE-CODE)), and the model weights are under Coqui Public Model License 1.0.0 ([LICENSE-MODEL](./LICENSE-MODEL)). Note that we use some of the code from other repository that are under different licenses: `./models/codebooks_patterns.py` is under MIT license; `./models/modules`, `./steps/optim.py`, `data/tokenizer.py` are under Apache License, Version 2.0; the phonemizer we used is under GNU 3.0 License. For drop-in replacement of the phonemizer (i.e. text to IPA phoneme mapping), try [g2p](https://github.com/roedoejet/g2p) (MIT License) or [OpenPhonemizer](https://github.com/NeuralVox/OpenPhonemizer) (BSD-3-Clause Clear), although these are not tested.
|
The codebase is under CC BY-NC-SA 4.0 ([LICENSE-CODE](./LICENSE-CODE)), and the model weights are under Coqui Public Model License 1.0.0 ([LICENSE-MODEL](./LICENSE-MODEL)). Note that we use some of the code from other repository that are under different licenses: `./models/codebooks_patterns.py` is under MIT license; `./models/modules`, `./steps/optim.py`, `data/tokenizer.py` are under Apache License, Version 2.0; the phonemizer we used is under GNU 3.0 License.
|
||||||
|
|
||||||
<!-- How to use g2p to convert english text into IPA phoneme sequence
|
<!-- How to use g2p to convert english text into IPA phoneme sequence
|
||||||
first install it with `pip install g2p`
|
first install it with `pip install g2p`
|
||||||
|
|
|
@ -30,13 +30,15 @@
|
||||||
"# import libs\n",
|
"# import libs\n",
|
||||||
"import torch\n",
|
"import torch\n",
|
||||||
"import torchaudio\n",
|
"import torchaudio\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"import random\n",
|
||||||
"\n",
|
"\n",
|
||||||
"from data.tokenizer import (\n",
|
"from data.tokenizer import (\n",
|
||||||
" AudioTokenizer,\n",
|
" AudioTokenizer,\n",
|
||||||
" TextTokenizer,\n",
|
" TextTokenizer,\n",
|
||||||
")\n",
|
")\n",
|
||||||
"\n",
|
"\n",
|
||||||
"from models import voicecraft\n"
|
"from models import voicecraft"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
|
@ -72,6 +74,15 @@
|
||||||
"silence_tokens = [1388,1898,131] # if there are long silence in the generated audio, reduce the stop_repetition to 3, 2 or even 1\n",
|
"silence_tokens = [1388,1898,131] # if there are long silence in the generated audio, reduce the stop_repetition to 3, 2 or even 1\n",
|
||||||
"stop_repetition = -1 # -1 means do not adjust prob of silence tokens. if there are long silence or unnaturally strecthed words, increase sample_batch_size to 2, 3 or even 4\n",
|
"stop_repetition = -1 # -1 means do not adjust prob of silence tokens. if there are long silence or unnaturally strecthed words, increase sample_batch_size to 2, 3 or even 4\n",
|
||||||
"# what this will do to the model is that the model will run sample_batch_size examples of the same audio, and pick the one that's the shortest\n",
|
"# what this will do to the model is that the model will run sample_batch_size examples of the same audio, and pick the one that's the shortest\n",
|
||||||
|
"def seed_everything(seed):\n",
|
||||||
|
" os.environ['PYTHONHASHSEED'] = str(seed)\n",
|
||||||
|
" random.seed(seed)\n",
|
||||||
|
" np.random.seed(seed)\n",
|
||||||
|
" torch.manual_seed(seed)\n",
|
||||||
|
" torch.cuda.manual_seed(seed)\n",
|
||||||
|
" torch.backends.cudnn.benchmark = False\n",
|
||||||
|
" torch.backends.cudnn.deterministic = True\n",
|
||||||
|
"seed_everything(seed)\n",
|
||||||
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# point to the original file or record the file\n",
|
"# point to the original file or record the file\n",
|
||||||
|
|
|
@ -122,11 +122,13 @@
|
||||||
"\n",
|
"\n",
|
||||||
"import torch\n",
|
"import torch\n",
|
||||||
"import torchaudio\n",
|
"import torchaudio\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"import random\n",
|
||||||
"\n",
|
"\n",
|
||||||
"from data.tokenizer import (\n",
|
"from data.tokenizer import (\n",
|
||||||
" AudioTokenizer,\n",
|
" AudioTokenizer,\n",
|
||||||
" TextTokenizer,\n",
|
" TextTokenizer,\n",
|
||||||
")"
|
")\n"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
|
@ -241,6 +243,16 @@
|
||||||
"sample_batch_size = 4 # NOTE: if the if there are long silence or unnaturally strecthed words, increase sample_batch_size to 5 or higher. What this will do to the model is that the model will run sample_batch_size examples of the same audio, and pick the one that's the shortest. So if the speech rate of the generated is too fast change it to a smaller number.\n",
|
"sample_batch_size = 4 # NOTE: if the if there are long silence or unnaturally strecthed words, increase sample_batch_size to 5 or higher. What this will do to the model is that the model will run sample_batch_size examples of the same audio, and pick the one that's the shortest. So if the speech rate of the generated is too fast change it to a smaller number.\n",
|
||||||
"seed = 1 # change seed if you are still unhappy with the result\n",
|
"seed = 1 # change seed if you are still unhappy with the result\n",
|
||||||
"\n",
|
"\n",
|
||||||
|
"def seed_everything(seed):\n",
|
||||||
|
" os.environ['PYTHONHASHSEED'] = str(seed)\n",
|
||||||
|
" random.seed(seed)\n",
|
||||||
|
" np.random.seed(seed)\n",
|
||||||
|
" torch.manual_seed(seed)\n",
|
||||||
|
" torch.cuda.manual_seed(seed)\n",
|
||||||
|
" torch.backends.cudnn.benchmark = False\n",
|
||||||
|
" torch.backends.cudnn.deterministic = True\n",
|
||||||
|
"seed_everything(seed)\n",
|
||||||
|
"\n",
|
||||||
"decode_config = {'top_k': top_k, 'top_p': top_p, 'temperature': temperature, 'stop_repetition': stop_repetition, 'kvcache': kvcache, \"codec_audio_sr\": codec_audio_sr, \"codec_sr\": codec_sr, \"silence_tokens\": silence_tokens, \"sample_batch_size\": sample_batch_size}\n",
|
"decode_config = {'top_k': top_k, 'top_p': top_p, 'temperature': temperature, 'stop_repetition': stop_repetition, 'kvcache': kvcache, \"codec_audio_sr\": codec_audio_sr, \"codec_sr\": codec_sr, \"silence_tokens\": silence_tokens, \"sample_batch_size\": sample_batch_size}\n",
|
||||||
"from inference_tts_scale import inference_one_sample\n",
|
"from inference_tts_scale import inference_one_sample\n",
|
||||||
"concated_audio, gen_audio = inference_one_sample(model, ckpt[\"config\"], phn2num, text_tokenizer, audio_tokenizer, audio_fn, target_transcript, device, decode_config, prompt_end_frame)\n",
|
"concated_audio, gen_audio = inference_one_sample(model, ckpt[\"config\"], phn2num, text_tokenizer, audio_tokenizer, audio_fn, target_transcript, device, decode_config, prompt_end_frame)\n",
|
||||||
|
@ -280,7 +292,7 @@
|
||||||
"kernelspec": {
|
"kernelspec": {
|
||||||
"display_name": "voicecraft",
|
"display_name": "voicecraft",
|
||||||
"language": "python",
|
"language": "python",
|
||||||
"name": "voicecraft"
|
"name": "python3"
|
||||||
},
|
},
|
||||||
"language_info": {
|
"language_info": {
|
||||||
"codemirror_mode": {
|
"codemirror_mode": {
|
||||||
|
@ -292,7 +304,7 @@
|
||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.9.19"
|
"version": "3.9.18"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
|
|
Loading…
Reference in New Issue