Merge 19593c5ce0 into 013a21c70d

Merge branch 'standalone'
improve automatic cutoff finding, delete editing script
2024-05-07 00:43:03 -07:00 · 2024-05-04 12:25:48 -05:00 · 2024-05-04 12:25:37 -05:00 · 2024-05-03 22:16:06 -05:00 · 2024-04-23 19:07:24 -05:00 · 2024-04-23 18:55:34 -05:00
4 changed files with 788 additions and 5 deletions
--- a/README.md
+++ b/README.md
@ -13,6 +13,8 @@ There are three ways (besides running Gradio in Colab):
 1. More flexible inference beyond Gradio UI in Google Colab. see [quickstart colab](#quickstart-colab)
 2. with docker. see [quickstart docker](#quickstart-docker)
 3. without docker. see [environment setup](#environment-setup). You can also run gradio locally if you choose this option
+4. As a standalone script that you can easily integrate into other projects.
+see [quickstart command line](#quickstart-command-line).

 When you are inside the docker image or you have installed all dependencies, Checkout [`inference_tts.ipynb`](./inference_tts.ipynb).

@ -21,7 +23,7 @@ If you want to do model development such as training/finetuning, I recommend fol
 ## News
 :star: 04/22/2024: 330M/830M TTS Enhanced Models are up [here](https://huggingface.co/pyp1), load them through [`gradio_app.py`](./gradio_app.py) or [`inference_tts.ipynb`](./inference_tts.ipynb)! Replicate demo is up, major thanks to [@chenxwh](https://github.com/chenxwh)!

-:star: 04/11/2024: VoiceCraft Gradio is now available on HuggingFace Spaces [here](https://huggingface.co/spaces/pyp1/VoiceCraft_gradio)! Major thanks to [@zuev-stepan](https://github.com/zuev-stepan), [@Sewlell](https://github.com/Sewlell), [@pgsoar](https://github.com/pgosar) [@Ph0rk0z](https://github.com/Ph0rk0z). 
+:star: 04/11/2024: VoiceCraft Gradio is now available on HuggingFace Spaces [here](https://huggingface.co/spaces/pyp1/VoiceCraft_gradio)! Major thanks to [@zuev-stepan](https://github.com/zuev-stepan), [@Sewlell](https://github.com/Sewlell), [@pgsoar](https://github.com/pgosar) [@Ph0rk0z](https://github.com/Ph0rk0z).

 :star: 04/05/2024: I finetuned giga330M with the TTS objective on gigaspeech and 1/5 of librilight. Weights are [here](https://huggingface.co/pyp1/VoiceCraft/tree/main). Make sure maximal prompt + generation length <= 16 seconds (due to our limited compute, we had to drop utterances longer than 16s in training data). Even stronger models forthcomming, stay tuned!

@ -37,11 +39,9 @@ If you want to do model development such as training/finetuning, I recommend fol
 - [x] Better guidance on training/finetuning
 - [x] Colab notebooks
 - [x] HuggingFace Spaces demo
- [ ] Command line
+- [x] Command line
 - [ ] Improve efficiency

-
-
 ## QuickStart Colab

 :star: To try out speech editing or TTS Inference with VoiceCraft, the simplest way is using Google Colab.
@ -50,6 +50,15 @@ Instructions to run are on the Colab itself.
 1. To try [Speech Editing](https://colab.research.google.com/drive/1FV7EC36dl8UioePY1xXijXTMl7X47kR_?usp=sharing)
 2. To try [TTS Inference](https://colab.research.google.com/drive/1lch_6it5-JpXgAQlUTRRI2z2_rk5K67Z?usp=sharing)

+## QuickStart Command Line
+
+:star: To use it as a standalone script, check out tts_demo.py and speech_editing_demo.py.
+Be sure to first [setup your environment](#environment-setup).
+Without arguments, they will run the standard demo arguments used as an example elsewhere
+in this repository. You can use the command line arguments to specify unique input audios,
+target transcripts, and inference hyperparameters. Run the help command for more information:
+`python3 tts_demo.py -h`
+
 ## QuickStart Docker
 :star: To try out TTS inference with VoiceCraft, you can also use docker. Thank [@ubergarm](https://github.com/ubergarm) and [@jayc88](https://github.com/jay-c88) for making this happen.

@ -197,7 +206,7 @@ cd ./z_scripts
 bash e830M.sh
 ```

-It's the same procedure to prepare your own custom dataset. Make sure that if 
+It's the same procedure to prepare your own custom dataset. Make sure that if

 ## Finetuning
 You also need to do step 1-4 as Training, and I recommend to use AdamW for optimization if you finetune a pretrained model for better stability. checkout script `./z_scripts/e830M_ft.sh`.
--- a/demo/temp/mfa_alignments/5895_34622_000026_000002.csv
+++ b/demo/temp/mfa_alignments/5895_34622_000026_000002.csv
@ -0,0 +1,106 @@
+Begin,End,Label,Type,Speaker
+0.04,0.58,gwynplaine,words,temp
+0.58,0.94,had,words,temp
+0.94,1.45,besides,words,temp
+1.45,1.62,for,words,temp
+1.62,1.86,his,words,temp
+1.86,2.16,work,words,temp
+2.16,2.31,and,words,temp
+2.31,2.49,for,words,temp
+2.49,2.71,his,words,temp
+2.71,3.03,feats,words,temp
+3.03,3.12,of,words,temp
+3.12,3.61,strength,words,temp
+3.95,4.25,round,words,temp
+4.25,4.45,his,words,temp
+4.45,4.7,neck,words,temp
+4.7,4.81,and,words,temp
+4.81,5.04,over,words,temp
+5.04,5.22,his,words,temp
+5.22,5.83,shoulders,words,temp
+6.16,6.31,an,words,temp
+6.41,7.15,esclavine,words,temp
+7.15,7.29,of,words,temp
+7.29,7.7,leather,words,temp
+0.04,0.1,G,phones,temp
+0.1,0.13,W,phones,temp
+0.13,0.22,IH1,phones,temp
+0.22,0.3,N,phones,temp
+0.3,0.38,P,phones,temp
+0.38,0.42,L,phones,temp
+0.42,0.53,EY1,phones,temp
+0.53,0.58,N,phones,temp
+0.58,0.71,HH,phones,temp
+0.71,0.86,AE1,phones,temp
+0.86,0.94,D,phones,temp
+0.94,0.97,B,phones,temp
+0.97,1.01,IH0,phones,temp
+1.01,1.14,S,phones,temp
+1.14,1.34,AY1,phones,temp
+1.34,1.4,D,phones,temp
+1.4,1.45,Z,phones,temp
+1.45,1.52,F,phones,temp
+1.52,1.55,AO1,phones,temp
+1.55,1.62,R,phones,temp
+1.62,1.69,HH,phones,temp
+1.69,1.76,IH1,phones,temp
+1.76,1.86,Z,phones,temp
+1.86,1.95,W,phones,temp
+1.95,2.07,ER1,phones,temp
+2.07,2.16,K,phones,temp
+2.16,2.23,AH0,phones,temp
+2.23,2.26,N,phones,temp
+2.26,2.31,D,phones,temp
+2.31,2.38,F,phones,temp
+2.38,2.41,AO1,phones,temp
+2.41,2.49,R,phones,temp
+2.49,2.55,HH,phones,temp
+2.55,2.62,IH1,phones,temp
+2.62,2.71,Z,phones,temp
+2.71,2.8,F,phones,temp
+2.8,2.9,IY1,phones,temp
+2.9,2.98,T,phones,temp
+2.98,3.03,S,phones,temp
+3.03,3.07,AH0,phones,temp
+3.07,3.12,V,phones,temp
+3.12,3.2,S,phones,temp
+3.2,3.26,T,phones,temp
+3.26,3.32,R,phones,temp
+3.32,3.39,EH1,phones,temp
+3.39,3.48,NG,phones,temp
+3.48,3.53,K,phones,temp
+3.53,3.61,TH,phones,temp
+3.95,4.03,R,phones,temp
+4.03,4.16,AW1,phones,temp
+4.16,4.21,N,phones,temp
+4.21,4.25,D,phones,temp
+4.25,4.29,HH,phones,temp
+4.29,4.36,IH1,phones,temp
+4.36,4.45,Z,phones,temp
+4.45,4.53,N,phones,temp
+4.53,4.62,EH1,phones,temp
+4.62,4.7,K,phones,temp
+4.7,4.74,AH0,phones,temp
+4.74,4.77,N,phones,temp
+4.77,4.81,D,phones,temp
+4.81,4.92,OW1,phones,temp
+4.92,4.97,V,phones,temp
+4.97,5.04,ER0,phones,temp
+5.04,5.11,HH,phones,temp
+5.11,5.18,IH1,phones,temp
+5.18,5.22,Z,phones,temp
+5.22,5.34,SH,phones,temp
+5.34,5.47,OW1,phones,temp
+5.47,5.51,L,phones,temp
+5.51,5.58,D,phones,temp
+5.58,5.71,ER0,phones,temp
+5.71,5.83,Z,phones,temp
+6.16,6.23,AE1,phones,temp
+6.23,6.31,N,phones,temp
+6.41,7.15,spn,phones,temp
+7.15,7.21,AH0,phones,temp
+7.21,7.29,V,phones,temp
+7.29,7.36,L,phones,temp
+7.36,7.44,EH1,phones,temp
+7.44,7.49,DH,phones,temp
+7.49,7.7,ER0,phones,temp
--- a/inference_tts.py
+++ b/inference_tts.py
@ -0,0 +1,452 @@
+#!/usr/bin/env python3
+
+import os
+import shutil
+import subprocess
+import sys
+import argparse
+import importlib
+
+from data.tokenizer import TextTokenizer, AudioTokenizer
+
+# The following requirements are for VoiceCraft inside inference_tts_scale.py
+try:
+    import torch
+    import torchaudio
+    import torchmetrics
+    import numpy
+    import tqdm
+    import phonemizer
+    import audiocraft
+except ImportError:
+    print(
+        "Pre-reqs not found. Installing numpy, torch, and audio dependencies.")
+    subprocess.run(
+        ["pip", "install", "numpy", "torch==2.0.1", "torchaudio",
+         "torchmetrics", "tqdm", "phonemizer"])
+
+    subprocess.run(["pip", "install", "-e",
+                    "git+https://github.com/facebookresearch/audiocraft.git"
+                    "@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft"])
+
+from inference_tts_scale import inference_one_sample
+from models import voicecraft
+
+description = """
+VoiceCraft Inference Text-to-Speech Demo
+This script demonstrates how to use the VoiceCraft model for text-to-speech synthesis.
+
+Pre-Requirements:
+- Python 3.9.16
+- Conda (https://docs.conda.io/en/latest/miniconda.html)
+- FFmpeg
+- eSpeak NG
+
+Usage:
+1. Prepare an audio file and its corresponding transcript.
+2. Run the script with the required command-line arguments:
+   python voicecraft_tts_demo.py --audio <path_to_audio_file> --transcript <path_to_transcript_file>
+3. The generated audio files will be saved in the `./demo/generated_tts` directory.
+
+Notes:
+- The script will download the required models automatically if they are not found in the `./pretrained_models` directory.
+- You can adjust the hyperparameters using command-line arguments to fine-tune the text-to-speech synthesis.
+"""
+
+
+def is_tool(name):
+    """Check whether `name` is on PATH and marked as executable."""
+    return shutil.which(name) is not None
+
+
+def run_command(command, error_message):
+    if command[0] == "source":
+        # Handle the 'source' command separately using os.system()
+        status = os.system(" ".join(command))
+        if status != 0:
+            print(error_message)
+            sys.exit(1)
+    else:
+        try:
+            subprocess.run(command, check=True)
+        except subprocess.CalledProcessError as e:
+            print(f"Error: {e}")
+            print(error_message)
+            sys.exit(1)
+
+
+def install_linux_dependencies():
+    if is_tool("apt-get"):
+        # Debian, Ubuntu, and derivatives
+        run_command(["sudo", "apt-get", "update"],
+                    "Failed to update package lists.")
+        run_command(["sudo", "apt-get", "install", "-y", "git-core", "ffmpeg",
+                     "espeak-ng"],
+                    "Failed to install Linux dependencies.")
+    elif is_tool("pacman"):
+        # Arch Linux and derivatives
+        run_command(["sudo", "pacman", "-Syu", "--noconfirm", "git", "ffmpeg",
+                     "espeak-ng"],
+                    "Failed to install Linux dependencies.")
+    elif is_tool("dnf"):
+        # Fedora and derivatives
+        run_command(
+            ["sudo", "dnf", "install", "-y", "git", "ffmpeg", "espeak-ng"],
+            "Failed to install Linux dependencies.")
+    elif is_tool("yum"):
+        # CentOS and derivatives
+        run_command(
+            ["sudo", "yum", "install", "-y", "git", "ffmpeg", "espeak-ng"],
+            "Failed to install Linux dependencies.")
+    else:
+        print(
+            "Error: Unsupported Linux distribution. Please install the dependencies manually.")
+        sys.exit(1)
+
+
+def install_macos_dependencies():
+    if is_tool("brew"):
+        packages = ["git", "ffmpeg", "espeak", "anaconda"]
+        missing_packages = [package for package in packages if
+                            not is_tool(package)]
+
+        if missing_packages:
+            run_command(["brew", "install"] + missing_packages,
+                        "Failed to install missing macOS dependencies.")
+        else:
+            print("All required packages are already installed.")
+
+        # Add Anaconda bin directory to PATH
+        anaconda_bin_path = "/opt/homebrew/anaconda3/bin"
+        os.environ["PATH"] = f"{anaconda_bin_path}:{os.environ['PATH']}"
+
+        # Update the shell configuration file (e.g., .bash_profile or .zshrc)
+        shell_config_file = os.path.expanduser(
+            "~/.bash_profile")  # or "~/.zshrc" for zsh
+        with open(shell_config_file, "a") as file:
+            file.write(f'\nexport PATH="{anaconda_bin_path}:$PATH"\n')
+
+    else:
+        print(
+            "Error: Homebrew not found. Please install Homebrew and try again.")
+        sys.exit(1)
+
+
+def install_dependencies():
+    if sys.platform == "win32":
+        print(description)
+        print("Please install the required dependencies manually on Windows.")
+        sys.exit(1)
+    elif sys.platform == "darwin":
+        install_macos_dependencies()
+    elif sys.platform.startswith("linux"):
+        install_linux_dependencies()
+    else:
+        print(f"Unsupported platform: {sys.platform}")
+        sys.exit(1)
+
+
+def install_conda_dependencies():
+    conda_packages = [
+        "montreal-forced-aligner=2.2.17",
+        "openfst=1.8.2",
+        "kaldi=5.5.1068"
+    ]
+
+    run_command(
+        ["conda", "install", "-y", "-c", "conda-forge", "--solver",
+         "classic"] + conda_packages,
+        "Failed to install Conda packages.")
+
+
+def create_conda_environment():
+    run_command(["conda", "create", "-y", "-n", "voicecraft", "python=3.9.16",
+                 "--solver", "classic"],
+                "Failed to create Conda environment.")
+
+    # Initialize Conda for the current shell session
+    conda_init_command = 'eval "$(conda shell.bash hook)"'
+    os.system(conda_init_command)
+
+    bashrc_path = os.path.expanduser("~/.bashrc")
+    if os.path.exists(bashrc_path):
+        run_command(["source", bashrc_path],
+                    "Failed to source .bashrc.")
+    else:
+        print("Warning: ~/.bashrc not found. Skipping sourcing.")
+
+    # Activate the Conda environment
+    activate_command = f"conda activate voicecraft"
+    os.system(activate_command)
+
+    # Install any required dependencies in Conda env
+    install_conda_dependencies()
+
+
+def install_python_dependencies():
+    pip_packages = [
+        "torch==2.0.1",
+        "tensorboard==2.16.2",
+        "phonemizer==3.2.1",
+        "torchaudio==2.0.2",
+        "datasets==2.16.0",
+        "torchmetrics==0.11.1"
+    ]
+
+    run_command(["pip", "install"] + pip_packages,
+                "Failed to install Python packages.")
+
+    run_command(["pip", "install", "-e",
+                 "git+https://github.com/facebookresearch/audiocraft.git"
+                 "@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft"],
+                "Failed to install audiocraft package.")
+
+
+def download_models(ckpt_fn, encodec_fn):
+    if not os.path.exists(ckpt_fn):
+        run_command(["wget",
+                     f"https://huggingface.co/pyp1/VoiceCraft/resolve/main/{os.path.basename(ckpt_fn)}?download=true"],
+                    f"Failed to download {ckpt_fn}.")
+        run_command(
+            ["mv", f"{os.path.basename(ckpt_fn)}?download=true", ckpt_fn],
+            f"Failed to move {ckpt_fn}.")
+
+    if not os.path.exists(encodec_fn):
+        run_command(["wget",
+                     "https://huggingface.co/pyp1/VoiceCraft/resolve/main/encodec_4cb2048_giga.th"],
+                    f"Failed to download {encodec_fn}.")
+        run_command(["mv", "encodec_4cb2048_giga.th", encodec_fn],
+                    f"Failed to move {encodec_fn}.")
+
+
+def check_python_dependencies():
+    dependencies = [
+        "torch",
+        "torchaudio",
+        "data.tokenizer",
+        "models.voicecraft",
+        "inference_tts_scale",
+        "audiocraft",
+        "phonemizer",
+        "tensorboard"
+    ]
+
+    missing_dependencies = []
+    for dependency in dependencies:
+        try:
+            importlib.import_module(dependency)
+        except ImportError:
+            missing_dependencies.append(dependency)
+
+    if missing_dependencies:
+        print("Missing Python dependencies:", missing_dependencies)
+        install_python_dependencies()
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser(description=description,
+                                     formatter_class=argparse.RawTextHelpFormatter)
+    parser.add_argument("-a", "--audio", required=True,
+                        help="Path to the input audio file used as a "
+                             "reference for the voice.")
+    parser.add_argument("-t", "--transcript", required=True,
+                        help="Path to the text file containing the transcript "
+                             "to be synthesized.")
+    parser.add_argument("--skip-install", "-s", action="store_true",
+                        help="Skip the installation of prerequisites.")
+    parser.add_argument("--output_dir", default="./demo/generated_tts",
+                        help="Output directory where the generated audio "
+                             "files will be saved. Default: "
+                             "'./demo/generated_tts'")
+    parser.add_argument("--cut-off-sec", type=float, default=3.0,
+                        help="Cut-off time in seconds for the audio prompt ("
+                             "hundredths of a second are acceptable). "
+                             "Default: 3.0")
+    parser.add_argument("--left_margin", type=float, default=0.08,
+                        help="Left margin of the audio segment used for "
+                             "speech editing. This is not used for "
+                             "text-to-speech synthesis. Default: 0.08")
+    parser.add_argument("--right_margin", type=float, default=0.08,
+                        help="Right margin of the audio segment used for "
+                             "speech editing. This is not used for "
+                             "text-to-speech synthesis. Default: 0.08")
+    parser.add_argument("--codec_audio_sr", type=int, default=16000,
+                        help="Sample rate of the audio codec used for "
+                             "encoding and decoding. Default: 16000")
+    parser.add_argument("--codec_sr", type=int, default=50,
+                        help="Sample rate of the codec used for encoding and "
+                             "decoding. Default: 50")
+    parser.add_argument("--top_k", type=int, default=0,
+                        help="Top-k sampling parameter. It limits the number "
+                             "of highest probability tokens to consider "
+                             "during generation. A higher value (e.g., "
+                             "50) will result in more diverse but potentially "
+                             "less coherent speech, while a lower value ("
+                             "e.g., 1) will result in more conservative and "
+                             "repetitive speech. Setting it to 0 disables "
+                             "top-k sampling. Default: 0")
+    parser.add_argument("--top_p", type=float, default=0.8,
+                        help="Top-p sampling parameter. It controls the "
+                             "diversity of the generated audio by truncating "
+                             "the least likely tokens whose cumulative "
+                             "probability exceeds 'p'. Lower values (e.g., "
+                             "0.5) will result in more conservative and "
+                             "repetitive speech, while higher values (e.g., "
+                             "0.9) will result in more diverse speech. "
+                             "Default: 0.8")
+    parser.add_argument("--temperature", type=float, default=1.0,
+                        help="Sampling temperature. It controls the "
+                             "randomness of the generated speech. Higher "
+                             "values (e.g., 1.5) will result in more "
+                             "expressive and varied speech, while lower "
+                             "values (e.g., 0.5) will result in more "
+                             "monotonous and conservative speech. Default: 1.0")
+    parser.add_argument("--kvcache", type=int, default=1,
+                        help="Key-value cache size used for caching "
+                             "intermediate results. A larger cache size may "
+                             "improve performance but consume more memory. "
+                             "Default: 1")
+    parser.add_argument("--seed", type=int, default=1,
+                        help="Random seed for reproducibility. Use the same "
+                             "seed value to generate the same output for a "
+                             "given input. Default: 1")
+    parser.add_argument("--stop_repetition", type=int, default=3,
+                        help="Stop repetition threshold. It controls the "
+                             "number of consecutive repetitions allowed in "
+                             "the generated speech. Lower values (e.g., "
+                             "1 or 2) will result in less repetitive speech "
+                             "but may also lead to abrupt stopping. Higher "
+                             "values (e.g., 4 or 5) will allow more "
+                             "repetitions. Default: 3")
+    parser.add_argument("--sample_batch_size", type=int, default=4,
+                        help="Number of audio samples generated in parallel. "
+                             "Increasing this value may improve the quality "
+                             "of the generated speech by reducing long "
+                             "silences or unnaturally stretched words, "
+                             "but it will also increase memory usage. "
+                             "Default: 4")
+    return parser.parse_args()
+
+
+def main():
+    args = parse_arguments()
+
+    if not args.skip_install:
+        install_dependencies()
+        create_conda_environment()
+        check_python_dependencies()
+
+    orig_audio = args.audio
+    orig_transcript = args.transcript
+    output_dir = args.output_dir
+
+    # Create the output directory if it doesn't exist
+    os.makedirs(output_dir, exist_ok=True)
+
+    # Hyperparameters for inference
+    left_margin = args.left_margin
+    right_margin = args.right_margin
+    codec_audio_sr = args.codec_audio_sr
+    codec_sr = args.codec_sr
+    top_k = args.top_k
+    top_p = args.top_p
+    temperature = args.temperature
+    kvcache = args.kvcache
+    silence_tokens = [1388, 1898, 131]
+    seed = args.seed
+    stop_repetition = args.stop_repetition
+    sample_batch_size = args.sample_batch_size
+
+    # Set the device based on available hardware
+    if torch.cuda.is_available():
+        device = "cuda"
+    elif sys.platform == "darwin" and torch.backends.mps.is_available():
+        device = "mps"
+    else:
+        device = "cpu"
+
+    # Move audio and transcript to temp folder
+    temp_folder = "./demo/temp"
+    os.makedirs(temp_folder, exist_ok=True)
+    subprocess.run(["cp", orig_audio, temp_folder])
+    filename = os.path.splitext(os.path.basename(orig_audio))[0]
+    with open(f"{temp_folder}/{filename}.txt", "w") as f:
+        f.write(orig_transcript)
+
+    # Run MFA to get the alignment
+    align_temp = f"{temp_folder}/mfa_alignments"
+    os.makedirs(align_temp, exist_ok=True)
+    subprocess.run(
+        ["mfa", "model", "download", "dictionary", "english_us_arpa"])
+    subprocess.run(["mfa", "model", "download", "acoustic", "english_us_arpa"])
+    subprocess.run(
+        ["mfa", "align", "-v", "--clean", "-j", "1", "--output_format", "csv",
+         temp_folder, "english_us_arpa", "english_us_arpa", align_temp])
+
+    audio_fn = f"{temp_folder}/{os.path.basename(orig_audio)}"
+    transcript_fn = f"{temp_folder}/{filename}.txt"
+    align_fn = f"{align_temp}/{filename}.csv"
+
+    # Decide which part of the audio to use as prompt based on forced alignment
+    cut_off_sec = args.cut_off_sec
+
+    info = torchaudio.info(audio_fn)
+    audio_dur = info.num_frames / info.sample_rate
+    assert cut_off_sec < audio_dur, f"cut_off_sec {cut_off_sec} is larger than the audio duration {audio_dur}"
+    prompt_end_frame = int(cut_off_sec * info.sample_rate)
+
+    # Load model, tokenizer, and other necessary files
+    voicecraft_name = "giga830M.pth"
+    ckpt_fn = f"./pretrained_models/{voicecraft_name}"
+    encodec_fn = "./pretrained_models/encodec_4cb2048_giga.th"
+
+    if not os.path.exists(ckpt_fn):
+        subprocess.run(["wget",
+                        f"https://huggingface.co/pyp1/VoiceCraft/resolve/main/{voicecraft_name}?download=true"])
+        subprocess.run(["mv", f"{voicecraft_name}?download=true",
+                        f"./pretrained_models/{voicecraft_name}"])
+
+    if not os.path.exists(encodec_fn):
+        subprocess.run(["wget",
+                        "https://huggingface.co/pyp1/VoiceCraft/resolve/main/encodec_4cb2048_giga.th"])
+        subprocess.run(["mv", "encodec_4cb2048_giga.th",
+                        "./pretrained_models/encodec_4cb2048_giga.th"])
+
+    ckpt = torch.load(ckpt_fn, map_location="cpu")
+    model = voicecraft.VoiceCraft(ckpt["config"])
+    model.load_state_dict(ckpt["model"])
+    model.to(device)
+    model.eval()
+
+    phn2num = ckpt['phn2num']
+    text_tokenizer = TextTokenizer(backend="espeak")
+    audio_tokenizer = AudioTokenizer(
+        signature=encodec_fn)  # will also put the neural codec model on gpu
+
+    # Run the model to get the output
+    decode_config = {
+        'top_k': top_k, 'top_p': top_p, 'temperature': temperature,
+        'stop_repetition': stop_repetition, 'kvcache': kvcache,
+        "codec_audio_sr": codec_audio_sr, "codec_sr": codec_sr,
+        "silence_tokens": silence_tokens, "sample_batch_size": sample_batch_size
+    }
+    concated_audio, gen_audio = inference_one_sample(
+        model, ckpt["config"], phn2num, text_tokenizer, audio_tokenizer,
+        audio_fn, transcript_fn, device, decode_config, prompt_end_frame
+    )
+
+    # Save segments for comparison
+    concated_audio, gen_audio = concated_audio[0].cpu(), gen_audio[0].cpu()
+
+    # Save the audio
+    seg_save_fn_gen = os.path.join(output_dir,
+                                   f"{os.path.basename(orig_audio)[:-4]}_gen_seed{seed}.wav")
+    seg_save_fn_concat = os.path.join(output_dir,
+                                      f"{os.path.basename(orig_audio)[:-4]}_concat_seed{seed}.wav")
+
+    torchaudio.save(seg_save_fn_gen, gen_audio, codec_audio_sr)
+    torchaudio.save(seg_save_fn_concat, concated_audio, codec_audio_sr)
+
+
+if __name__ == "__main__":
+    main()
--- a/tts_demo.py
+++ b/tts_demo.py
@ -0,0 +1,216 @@
+"""
+This script will allow you to run TTS inference with Voicecraft
+Before getting started, be sure to follow the environment setup.
+"""
+
+from inference_tts_scale import inference_one_sample
+from models import voicecraft
+from data.tokenizer import (
+    AudioTokenizer,
+    TextTokenizer,
+)
+import argparse
+import random
+import numpy as np
+import torchaudio
+import torch
+import os
+os.environ["USER"] = "me"  # TODO change this to your username
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser(
+        description="VoiceCraft TTS Inference: see the script for more information on the options")
+
+    parser.add_argument("-m", "--model_name", type=str, default="giga830M", choices=[
+                        "giga330M", "giga830M", "giga330M_TTSEnhanced", "giga830M_TTSEnhanced"],
+                        help="VoiceCraft model to use")
+    parser.add_argument("-st", "--silence_tokens", type=int, nargs="*",
+                        default=[1388, 1898, 131], help="Silence token IDs")
+    parser.add_argument("-casr", "--codec_audio_sr", type=int,
+                        default=16000, help="Codec audio sample rate.")
+    parser.add_argument("-csr", "--codec_sr", type=int, default=50,
+                        help="Codec sample rate.")
+
+    parser.add_argument("-k", "--top_k", type=float,
+                        default=0, help="Top k value.")
+    parser.add_argument("-p", "--top_p", type=float,
+                        default=0.8, help="Top p value.")
+    parser.add_argument("-t", "--temperature", type=float,
+                        default=1, help="Temperature value.")
+    parser.add_argument("-kv", "--kvcache", type=float, choices=[0, 1],
+                        default=0, help="Kvcache value.")
+    parser.add_argument("-sr", "--stop_repetition", type=int,
+                        default=-1, help="Stop repetition for generation")
+    parser.add_argument("--sample_batch_size", type=int,
+                        default=3, help="Batch size for sampling")
+    parser.add_argument("-s", "--seed", type=int,
+                        default=1, help="Seed value.")
+    parser.add_argument("-bs", "--beam_size", type=int, default=50,
+                        help="beam size for MFA alignment")
+    parser.add_argument("-rbs", "--retry_beam_size", type=int, default=200,
+                        help="retry beam size for MFA alignment")
+    parser.add_argument("--output_dir", type=str, default="./generated_tts",
+                        help="directory to save generated audio")
+    parser.add_argument("-oa", "--original_audio", type=str,
+                        default="./demo/5895_34622_000026_000002.wav", help="location of audio file")
+    parser.add_argument("-ot", "--original_transcript", type=str,
+                        default="Gwynplaine had, besides, for his work and for his feats of strength, round his neck and over his shoulders, an esclavine of leather.",
+                        help="original transcript")
+    parser.add_argument("-tt", "--target_transcript", type=str,
+                        default="I cannot believe that the same model can also do text to speech synthesis too!",
+                        help="target transcript")
+    parser.add_argument("-co", "--cut_off_sec", type=float, default=3.6,
+                        help="cut off point in seconds for input prompt")
+    parser.add_argument("-ma", "--margin", type=float, default=0.04,
+                    help="margin in seconds between the end of the cutoff words and the start of the next word. If the next word is not immediately following the cutoff word, the algorithm is more tolerant to word alignment errors")
+    parser.add_argument("-cuttol", "--cutoff_tolerance", type=float, default=1, help="tolerance in seconds for the cutoff time, if given cut_off_sec plus the tolerance, we still are not able to find the next word, we will use the best cutoff time found, i.e. likely no margin or very small margin between the end of the cutoff word and the start of the next word")
+
+    args = parser.parse_args()
+    return args
+
+
+args = parse_arguments()
+voicecraft_name = args.model_name
+# hyperparameters for inference
+codec_audio_sr = args.codec_audio_sr
+codec_sr = args.codec_sr
+top_k = args.top_k
+top_p = args.top_p  # defaults to 0.9 can also try 0.8, but 0.9 seems to work better
+temperature = args.temperature
+silence_tokens = args.silence_tokens
+kvcache = args.kvcache  # NOTE if OOM, change this to 0, or try the 330M model
+
+# NOTE adjust the below three arguments if the generation is not as good
+# NOTE if the model generate long silence, reduce the stop_repetition to 3, 2 or even 1
+stop_repetition = args.stop_repetition
+
+# NOTE: if the if there are long silence or unnaturally strecthed words,
+# increase sample_batch_size to 4 or higher. What this will do to the model is that the
+# model will run sample_batch_size examples of the same audio, and pick the one that's the shortest.
+# So if the speech rate of the generated is too fast change it to a smaller number.
+sample_batch_size = args.sample_batch_size
+seed = args.seed  # change seed if you are still unhappy with the result
+
+# load the model
+if voicecraft_name == "330M":
+    voicecraft_name = "giga330M"
+elif voicecraft_name == "830M":
+    voicecraft_name = "giga830M"
+elif voicecraft_name == "330M_TTSEnhanced":
+    voicecraft_name = "330M_TTSEnhanced"
+elif voicecraft_name == "830M_TTSEnhanced":
+    voicecraft_name = "830M_TTSEnhanced"
+model = voicecraft.VoiceCraft.from_pretrained(
+    f"pyp1/VoiceCraft_{voicecraft_name.replace('.pth', '')}")
+phn2num = model.args.phn2num
+config = vars(model.args)
+model.to(device)
+
+encodec_fn = "./pretrained_models/encodec_4cb2048_giga.th"
+if not os.path.exists(encodec_fn):
+    os.system(
+        f"wget https://huggingface.co/pyp1/VoiceCraft/resolve/main/encodec_4cb2048_giga.th -O ./pretrained_models/encodec_4cb2048_giga.th")
+# will also put the neural codec model on gpu
+audio_tokenizer = AudioTokenizer(signature=encodec_fn, device=device)
+
+text_tokenizer = TextTokenizer(backend="espeak")
+
+# Prepare your audio
+# point to the original audio whose speech you want to clone
+# write down the transcript for the file, or run whisper to get the transcript (and you can modify it if it's not accurate), save it as a .txt file
+orig_audio = args.original_audio
+orig_transcript = args.original_transcript
+
+# move the audio and transcript to temp folder
+temp_folder = "./demo/temp"
+os.makedirs(temp_folder, exist_ok=True)
+os.system(f"cp {orig_audio} {temp_folder}")
+filename = os.path.splitext(orig_audio.split("/")[-1])[0]
+with open(f"{temp_folder}/{filename}.txt", "w") as f:
+    f.write(orig_transcript)
+# run MFA to get the alignment
+align_temp = f"{temp_folder}/mfa_alignments"
+beam_size = args.beam_size
+retry_beam_size = args.retry_beam_size
+alignments = f"{temp_folder}/mfa_alignments/{filename}.csv"
+if not os.path.isfile(alignments):
+    os.system(f"mfa align -v --clean -j 1 --output_format csv {temp_folder} \
+            english_us_arpa english_us_arpa {align_temp} --beam {beam_size} --retry_beam {retry_beam_size}")
+# if the above fails, it could be because the audio is too hard for the alignment model,
+# increasing the beam_size and retry_beam_size usually solves the issue
+
+def find_closest_word_boundary(alignments, cut_off_sec, margin, cutoff_tolerance = 1):
+    with open(alignments, 'r') as file:
+        # skip header
+        next(file)
+        cutoff_time = None
+        cutoff_index = None
+        cutoff_time_best = None
+        cutoff_index_best = None
+        lines = [l for l in file.readlines()]
+        for i, line in enumerate(lines):
+            end = float(line.strip().split(',')[1])
+            if end >= cut_off_sec and cutoff_time == None:
+                cutoff_time = end
+                cutoff_index = i
+            if end >= cut_off_sec and end < cut_off_sec + cutoff_tolerance and float(lines[i+1].strip().split(',')[0]) - end >= margin:
+                    cutoff_time_best = end + margin * 2 / 3
+                    cutoff_index_best = i
+                    break
+        if cutoff_time_best != None:
+            cutoff_time = cutoff_time_best
+            cutoff_index = cutoff_index_best
+        return cutoff_time, cutoff_index
+
+# take a look at demo/temp/mfa_alignment, decide which part of the audio to use as prompt
+# NOTE: according to forced-alignment file demo/temp/mfa_alignments/5895_34622_000026_000002.wav, the word "strength" stop as 3.561 sec, so we use first 3.6 sec as the prompt. this should be different for different audio
+cut_off_sec = args.cut_off_sec
+margin = args.margin
+audio_fn = f"{temp_folder}/{filename}.wav"
+
+cut_off_sec, cut_off_word_idx = find_closest_word_boundary(alignments, cut_off_sec, margin, args.cutoff_tolerance)
+target_transcript = " ".join(orig_transcript.split(" ")[:cut_off_word_idx+1]) + " " + args.target_transcript
+# NOTE: 3 sec of reference is generally enough for high quality voice cloning, but longer is generally better, try e.g. 3~6 sec.
+info = torchaudio.info(audio_fn)
+audio_dur = info.num_frames / info.sample_rate
+
+assert cut_off_sec < audio_dur, f"cut_off_sec {cut_off_sec} is larger than the audio duration {audio_dur}"
+prompt_end_frame = int(cut_off_sec * info.sample_rate)
+
+
+def seed_everything(seed):
+    os.environ['PYTHONHASHSEED'] = str(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed(seed)
+    torch.backends.cudnn.benchmark = False
+    torch.backends.cudnn.deterministic = True
+
+
+seed_everything(seed)
+
+# inference
+decode_config = {'top_k': top_k, 'top_p': top_p, 'temperature': temperature, 'stop_repetition': stop_repetition, 'kvcache': kvcache,
+                 "codec_audio_sr": codec_audio_sr, "codec_sr": codec_sr, "silence_tokens": silence_tokens, "sample_batch_size": sample_batch_size}
+concated_audio, gen_audio = inference_one_sample(model, argparse.Namespace(
+    **config), phn2num, text_tokenizer, audio_tokenizer, audio_fn, target_transcript, device, decode_config, prompt_end_frame)
+
+# save segments for comparison
+concated_audio, gen_audio = concated_audio[0].cpu(), gen_audio[0].cpu()
+# logging.info(f"length of the resynthesize orig audio: {orig_audio.shape}")
+
+# save the audio
+# output_dir
+output_dir = args.output_dir
+os.makedirs(output_dir, exist_ok=True)
+seg_save_fn_gen = f"{output_dir}/{os.path.basename(audio_fn)[:-4]}_gen_seed{seed}.wav"
+seg_save_fn_concat = f"{output_dir}/{os.path.basename(audio_fn)[:-4]}_concat_seed{seed}.wav"
+
+torchaudio.save(seg_save_fn_gen, gen_audio, codec_audio_sr)
+torchaudio.save(seg_save_fn_concat, concated_audio, codec_audio_sr)
+
+# you might get warnings like WARNING:phonemizer:words count mismatch on 300.0% of the lines (3/1), this can be safely ignored
Author	SHA1	Message	Date
Jay	e06eb3aca6	Merge `19593c5ce0` into `013a21c70d`	2024-05-07 00:43:03 -07:00
pyp_l40	013a21c70d	Merge branch 'standalone'	2024-05-04 12:25:48 -05:00
pyp_l40	ef9d65433c	improve automatic cutoff finding, delete editing script	2024-05-04 12:25:37 -05:00
Pranay Gosar	1a896d21fe	adjust cut off sec and target transcript	2024-05-03 22:16:06 -05:00
pgosar	9fb6d948d0	add simple running instructions	2024-04-23 19:07:24 -05:00
Pranay Gosar	1850da9210	add short form commands	2024-04-23 18:55:34 -05:00
Pranay Gosar	59877c085e	add speech editing	2024-04-23 18:38:09 -05:00
Pranay Gosar	b8bb2ab592	add beam size cmd args	2024-04-23 15:25:43 -05:00
Pranay Gosar	63736f7269	add TTS	2024-04-23 13:01:44 -05:00
Pranay Gosar	fc4de13071	Merge branch 'jasonppy:master' into standalone	2024-04-23 12:07:43 -05:00
pgosar	1e0eaeba2b	add files	2024-04-17 16:32:50 -05:00
JSTayco	19593c5ce0	Changed cutoff timer to default to 3 seconds, overridable by user.	2024-04-08 21:07:30 -07:00
JSTayco	552d0bcd0d	Many fixes and better PEP 8 formatting Still seeing errors with audiocraft in data/tokenizer.py on macOS. Also, seeing an error with MFA corpus file not being found.	2024-04-07 19:46:12 -07:00
Jay	66049a2526	Added quick Python demo for users that may not have Jupyter.	2024-03-30 23:03:13 +00:00