Zero-Shot Speech Editing and Text-to-Speech in the Wild
Go to file
Puyuan Peng 4873249ba3
Merge pull request #140 from Approximetal/patch-1
Remove extra whitespaces to prevent unwanted intonation
2024-06-24 10:25:51 -04:00
data extraction,training,data,weights 2024-03-24 19:43:37 -07:00
demo improve automatic cutoff finding, delete editing script 2024-05-04 12:25:37 -05:00
models Deduplicate VoiceCraftHF <> VoiceCraft 2024-04-16 10:24:05 +02:00
pretrained_models weights, notebook working 2024-03-28 16:21:30 -07:00
steps small fix 2024-04-29 08:47:32 -05:00
z_scripts finetune 830M 2024-04-08 15:12:51 -07:00
.dockerignore Merged changes from upstream 2024-04-12 14:18:51 +00:00
.gitignore Merge updates from original repository 2024-04-21 22:36:16 +00:00
Dockerfile hf model download 2024-04-13 15:11:15 -07:00
LICENSE-CODE init 2024-03-21 11:02:20 -07:00
LICENSE-MODEL init 2024-03-21 11:02:20 -07:00 improve automatic cutoff finding, delete editing script 2024-05-04 12:25:37 -05:00
RealEdit.txt add RealEdit.txt; fixed masking warning to be compatible with torch 2.1.0 2024-04-05 12:52:28 -07:00
cog.yaml update 2024-04-21 22:30:56 +00:00 init 2024-03-21 11:02:20 -07:00 init 2024-03-21 11:02:20 -07:00
environment.yml Update huggingface-hub to version that exists 2024-04-27 01:39:42 +02:00
gradio_app.ipynb README update, gradio_app.ipynb update, debug print removed 2024-04-05 04:40:57 +03:00 Update 2024-06-13 16:00:51 +08:00
gradio_requirements.txt better handle numbers 2024-04-22 11:56:39 -05:00
inference_speech_editing.ipynb fix editing notebook 2024-04-18 12:38:55 -07:00 init 2024-03-21 11:02:20 -07:00
inference_tts.ipynb Replicate 2024-04-22 14:26:05 -05:00 Empty cuda cache between inferences 2024-04-06 00:05:06 +00:00 init 2024-03-21 11:02:20 -07:00 update 2024-04-21 22:30:56 +00:00
start-jupyter.bat Merge branch 'jasonppy:master' into master 2024-04-05 02:56:01 +03:00 Merge branch 'jasonppy:master' into master 2024-04-05 02:56:01 +03:00 improve automatic cutoff finding, delete editing script 2024-05-04 12:25:37 -05:00
voicecraft-gradio-colab.ipynb nltk punkt, typer version 2024-04-21 11:38:09 -05:00

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Paper HuggingFace Colab Replicate YouTube demo Demo page


VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.

To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.

How to run inference

There are three ways (besides running Gradio in Colab):

  1. More flexible inference beyond Gradio UI in Google Colab. see quickstart colab
  2. with docker. see quickstart docker
  3. without docker. see environment setup. You can also run gradio locally if you choose this option
  4. As a standalone script that you can easily integrate into other projects. see quickstart command line.

When you are inside the docker image or you have installed all dependencies, Checkout inference_tts.ipynb.

If you want to do model development such as training/finetuning, I recommend following envrionment setup and training.


04/22/2024: 330M/830M TTS Enhanced Models are up here, load them through or inference_tts.ipynb! Replicate demo is up, major thanks to @chenxwh!

04/11/2024: VoiceCraft Gradio is now available on HuggingFace Spaces here! Major thanks to @zuev-stepan, @Sewlell, @pgsoar @Ph0rk0z.

04/05/2024: I finetuned giga330M with the TTS objective on gigaspeech and 1/5 of librilight. Weights are here. Make sure maximal prompt + generation length <= 16 seconds (due to our limited compute, we had to drop utterances longer than 16s in training data). Even stronger models forthcomming, stay tuned!

03/28/2024: Model weights for giga330M and giga830M are up on HuggingFace🤗 here!


  • Codebase upload
  • Environment setup
  • Inference demo for speech editing and TTS
  • Training guidance
  • RealEdit dataset and training manifest
  • Model weights
  • Better guidance on training/finetuning
  • Colab notebooks
  • HuggingFace Spaces demo
  • Command line
  • Improve efficiency

QuickStart Colab

To try out speech editing or TTS Inference with VoiceCraft, the simplest way is using Google Colab. Instructions to run are on the Colab itself.

  1. To try Speech Editing
  2. To try TTS Inference

QuickStart Command Line

To use it as a standalone script, check out and Be sure to first setup your environment. Without arguments, they will run the standard demo arguments used as an example elsewhere in this repository. You can use the command line arguments to specify unique input audios, target transcripts, and inference hyperparameters. Run the help command for more information: python3 -h

QuickStart Docker

To try out TTS inference with VoiceCraft, you can also use docker. Thank @ubergarm and @jayc88 for making this happen.

Tested on Linux and Windows and should work with any host with docker installed.

# 1. clone the repo on in a directory on a drive with plenty of free space
git clone
cd VoiceCraft

# 2. assumes you have docker installed with nvidia container container-toolkit (windows has this built into the driver)
# sudo apt-get install -y nvidia-container-toolkit-base || yay -Syu nvidia-container-toolkit || echo etc...

# 3. First build the docker image
docker build --tag "voicecraft" .

# 4. Try to start an existing container otherwise create a new one passing in all GPUs
./  # linux
start-jupyter.bat   # windows

# 5. now open a webpage on the host box to the URL shown at the bottom of:
docker logs jupyter

# 6. optionally look inside from another terminal
docker exec -it jupyter /bin/bash
export USER=(your_linux_username_used_above)
export HOME=/home/$USER
sudo apt-get update

# 7. confirm video card(s) are visible inside container

# 8. Now in browser, open inference_tts.ipynb and work through one cell at a time

Environment setup

conda create -n voicecraft python=3.9.16
conda activate voicecraft

pip install -e git+
pip install xformers==0.0.22
pip install torchaudio==2.0.2 torch==2.0.1 # this assumes your system is compatible with CUDA 11.7, otherwise checkout
apt-get install ffmpeg # if you don't already have ffmpeg installed
apt-get install espeak-ng # backend for the phonemizer installed below
pip install tensorboard==2.16.2
pip install phonemizer==3.2.1
pip install datasets==2.16.0
pip install torchmetrics==0.11.1
pip install huggingface_hub==0.22.2
# install MFA for getting forced-alignment, this could take a few minutes
conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068
# install MFA english dictionary and model
mfa model download dictionary english_us_arpa
mfa model download acoustic english_us_arpa
# pip install huggingface_hub
# conda install pocl # above gives an warning for installing pocl, not sure if really need this

# to run ipynb
conda install -n voicecraft ipykernel --no-deps --force-reinstall

If you have encountered version issues when running things, checkout environment.yml for exact matching.

Inference Examples

Checkout inference_speech_editing.ipynb and inference_tts.ipynb


Run in colab

Open in Colab

Run locally

After environment setup install additional dependencies:

apt-get install -y espeak espeak-data libespeak1 libespeak-dev
apt-get install -y festival*
apt-get install -y build-essential
apt-get install -y flac libasound2-dev libsndfile1-dev vorbis-tools
apt-get install -y libxml2-dev libxslt-dev zlib1g-dev
pip install -r gradio_requirements.txt

Run gradio server from terminal or gradio_app.ipynb:


It is ready to use on default url.

How to use it

  1. (optionally) Select models
  2. Load models
  3. Transcribe
  4. (optionally) Tweak some parameters
  5. Run
  6. (optionally) Rerun part-by-part in Long TTS mode

Some features

Smart transcript: write only what you want to generate

TTS mode: Zero-shot TTS

Edit mode: Speech editing

Long TTS mode: Easy TTS on long texts


To train an VoiceCraft model, you need to prepare the following parts:

  1. utterances and their transcripts
  2. encode the utterances into codes using e.g. Encodec
  3. convert transcripts into phoneme sequence, and a phoneme set (we named it vocab.txt)
  4. manifest (i.e. metadata)

Step 1,2,3 are handled in ./data/, where

  1. Gigaspeech is downloaded through HuggingFace. Note that you need to sign an agreement in order to download the dataset (it needs your auth token)
  2. phoneme sequence and encodec codes are also extracted using the script.

An example run:

conda activate voicecraft
cd ./data
python \
--dataset_size xs \
--download_to path/to/store_huggingface_downloads \
--save_dir path/to/store_extracted_codes_and_phonemes \
--encodec_model_path path/to/encodec_model \
--mega_batch_size 120 \
--batch_size 32 \
--max_len 30000

where encodec_model_path is avaliable here. This model is trained on Gigaspeech XL, it has 56M parameters, 4 codebooks, each codebook has 2048 codes. Details are described in our paper. If you encounter OOM during extraction, try decrease the batch_size and/or max_len. The extracted codes, phonemes, and vocab.txt will be stored at path/to/store_extracted_codes_and_phonemes/${dataset_size}/{encodec_16khz_4codebooks,phonemes,vocab.txt}.

As for manifest, please download train.txt and validation.txt from here, and put them under path/to/store_extracted_codes_and_phonemes/manifest/. Please also download vocab.txt from here if you want to use our pretrained VoiceCraft model (so that the phoneme-to-token matching is the same).

Now, you are good to start training!

conda activate voicecraft
cd ./z_scripts

It's the same procedure to prepare your own custom dataset. Make sure that if


You also need to do step 1-4 as Training, and I recommend to use AdamW for optimization if you finetune a pretrained model for better stability. checkout script ./z_scripts/

If your dataset introduce new phonemes (which is very likely) that doesn't exist in the giga checkpoint, make sure you combine the original phonemes with the phoneme from your data when construction vocab. And you need to adjust --text_vocab_size and --text_pad_token so that the former is bigger than or equal to you vocab size, and the latter has the same value as --text_vocab_size (i.e. --text_pad_token is always the last token). Also since the text embedding are now of a different size, make sure you modify the weights loading part so that I won't crash (you could skip loading text_embedding or only load the existing part, and randomly initialize the new)


The codebase is under CC BY-NC-SA 4.0 (LICENSE-CODE), and the model weights are under Coqui Public Model License 1.0.0 (LICENSE-MODEL). Note that we use some of the code from other repository that are under different licenses: ./models/ is under MIT license; ./models/modules, ./steps/, data/ are under Apache License, Version 2.0; the phonemizer we used is under GNU 3.0 License.


We thank Feiteng for his VALL-E reproduction, and we thank audiocraft team for open-sourcing encodec.


  author    = {Peng, Puyuan and Huang, Po-Yao and Mohamed, Abdelrahman and Harwath, David},
  title     = {VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild},
  journal   = {arXiv},
  year      = {2024},


Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.