d754e9109a
[minor change] fix markdown checkbox formatting |
||
---|---|---|
data | ||
demo | ||
models | ||
steps | ||
z_scripts | ||
.gitignore | ||
config.py | ||
edit_utils.py | ||
inference_speech_editing_scale.py | ||
inference_speech_editing.ipynb | ||
inference_tts_scale.py | ||
inference_tts.ipynb | ||
LICENSE-CODE | ||
LICENSE-MODEL | ||
main.py | ||
README.md |
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
TL;DR: VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.
To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.
TODO
The TODOs left will be completed by the end of March 2024.
- Codebase upload
- Environment setup
- Inference demo for speech editing and TTS
- Upload model weights
- Training guidance
- Upload the RealEdit dataset
Environment setup
conda create -n voicecraft python=3.9.16
conda activate voicecraft
pip install torch==2.0.1 torchaudio==2.0.2 # this assumes your system is compatible with CUDA 11.7, otherwise checkout https://pytorch.org/get-started/previous-versions/#v201
apt-get install ffmpeg # if you don't already have ffmpeg installed
pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
apt-get install espeak-ng # backend for the phonemizer installed below
pip install phonemizer==3.2.1
pip install tensorboard
pip install datasets==2.12.0
# install MFA for getting forced-alignment, this could take a few minutes
conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068
# conda install pocl # above gives an warning for installing pocl, not sure if really need this
# to run ipynb
conda install -n voicecraft ipykernel --update-deps --force-reinstall
Inference Examples
Checkout inference_speech_editing.ipynb
and inference_tts.ipynb
License
The codebase is under CC BY-NC-SA 4.0 (LICENSE-CODE), and the model weights are under Coqui Public Model License 1.0.0 (LICENSE-MODEL). Note that we use some of the code from other repository that are under different licenses: ./models/codebooks_patterns.py
is under MIT license; ./models/modules
, ./steps/optim.py
, data/tokenizer.py
are under Apache License, Version 2.0; the phonemizer we used is under GNU 3.0 License. For drop-in replacement of the phonemizer (i.e. text to IPA phoneme mapping), try g2p (MIT License) or OpenPhonemizer (BSD-3-Clause Clear), although these are not tested.
Acknowledgement
We thank Feiteng for his VALL-E reproduction, and we thank audiocraft team for open-sourcing encodec.
Citation
@article{peng2024voicecraft,
author = {Peng, Puyuan and Huang, Po-Yao and Li, Daniel and Mohamed, Abdelrahman and Harwath, David},
title = {VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild},
journal = {arXiv},
year = {2024},
}
Disclaimer
Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.