a little massage

This commit is contained in:
jason-on-salt-a40 2024-04-11 07:17:28 -07:00
parent b818145ad9
commit ad6c2cd836
4 changed files with 26 additions and 22 deletions

View File

@ -1,6 +1,5 @@
# VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild # VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
[Demo](https://jasonppy.github.io/VoiceCraft_web) [Paper](https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf) [![Paper](https://img.shields.io/badge/arXiv-2301.12503-brightgreen.svg?style=flat-square)](https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf) [![githubio](https://img.shields.io/badge/GitHub.io-Audio_Samples-blue?logo=Github&style=flat-square)](https://jasonppy.github.io/VoiceCraft_web/) [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/pyp1/VoiceCraft_gradio) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1IOjpglQyMTO2C3Y94LD9FY0Ocn-RJRg6?usp=sharing)
### TL;DR ### TL;DR
VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both **speech editing** and **zero-shot text-to-speech (TTS)** on in-the-wild data including audiobooks, internet videos, and podcasts. VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both **speech editing** and **zero-shot text-to-speech (TTS)** on in-the-wild data including audiobooks, internet videos, and podcasts.
@ -8,20 +7,22 @@ VoiceCraft is a token infilling neural codec language model, that achieves state
To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference. To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.
## How to run inference ## How to run inference
There are three ways: There are three ways (besides running Gradio in Colab):
1. with Google Colab. see [quickstart colab](#quickstart-colab) 1. More flexible inference beyond Gradio UI in Google Colab. see [quickstart colab](#quickstart-colab)
2. with docker. see [quickstart docker](#quickstart-docker) 2. with docker. see [quickstart docker](#quickstart-docker)
3. without docker. see [environment setup](#environment-setup) 3. without docker. see [environment setup](#environment-setup). You can also run gradio locally if you choose this option
When you are inside the docker image or you have installed all dependencies, Checkout [`inference_tts.ipynb`](./inference_tts.ipynb). When you are inside the docker image or you have installed all dependencies, Checkout [`inference_tts.ipynb`](./inference_tts.ipynb).
If you want to do model development such as training/finetuning, I recommend following [envrionment setup](#environment-setup) and [training](#training). If you want to do model development such as training/finetuning, I recommend following [envrionment setup](#environment-setup) and [training](#training).
## News ## News
:star: 03/28/2024: Model weights for giga330M and giga830M are up on HuggingFace🤗 [here](https://huggingface.co/pyp1/VoiceCraft/tree/main)! :star: 04/11/2024: VoiceCraft Gradio is now available on HuggingFace Spaces [here](https://huggingface.co/spaces/pyp1/VoiceCraft_gradio)! Major thanks to [@zuev-stepan](https://github.com/zuev-stepan), [@Sewlell](https://github.com/Sewlell), [@pgsoar](https://github.com/pgosar) [@Ph0rk0z](https://github.com/Ph0rk0z).
:star: 04/05/2024: I finetuned giga330M with the TTS objective on gigaspeech and 1/5 of librilight, the model outperforms giga830M on TTS. Weights are [here](https://huggingface.co/pyp1/VoiceCraft/tree/main). Make sure maximal prompt + generation length <= 16 seconds (due to our limited compute, we had to drop utterances longer than 16s in training data) :star: 04/05/2024: I finetuned giga330M with the TTS objective on gigaspeech and 1/5 of librilight. Weights are [here](https://huggingface.co/pyp1/VoiceCraft/tree/main). Make sure maximal prompt + generation length <= 16 seconds (due to our limited compute, we had to drop utterances longer than 16s in training data). Even stronger models forthcomming, stay tuned!
:star: 03/28/2024: Model weights for giga330M and giga830M are up on HuggingFace🤗 [here](https://huggingface.co/pyp1/VoiceCraft/tree/main)!
## TODO ## TODO
- [x] Codebase upload - [x] Codebase upload
@ -30,9 +31,12 @@ If you want to do model development such as training/finetuning, I recommend fol
- [x] Training guidance - [x] Training guidance
- [x] RealEdit dataset and training manifest - [x] RealEdit dataset and training manifest
- [x] Model weights (giga330M.pth, giga830M.pth, and gigaHalfLibri330M_TTSEnhanced_max16s.pth) - [x] Model weights (giga330M.pth, giga830M.pth, and gigaHalfLibri330M_TTSEnhanced_max16s.pth)
- [x] Write colab notebooks for better hands-on experience - [x] Better guidance on training/finetuning
- [ ] HuggingFace Spaces demo - [x] Colab notebooks
- [ ] Better guidance on training/finetuning - [x] HuggingFace Spaces demo
- [ ] Command line
- [ ] Improve efficiency
## QuickStart Colab ## QuickStart Colab
@ -109,7 +113,7 @@ Checkout [`inference_speech_editing.ipynb`](./inference_speech_editing.ipynb) an
## Gradio ## Gradio
### Run in colab ### Run in colab
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zuev-stepan/VoiceCraft-gradio/blob/feature/colab-notebook/voicecraft-gradio-colab.ipynb) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1IOjpglQyMTO2C3Y94LD9FY0Ocn-RJRg6?usp=sharing)
### Run locally ### Run locally
After environment setup install additional dependencies: After environment setup install additional dependencies:

BIN
demo/pam.wav Normal file

Binary file not shown.

View File

@ -13,7 +13,7 @@ import random
import uuid import uuid
DEMO_PATH = os.getenv("DEMO_PATH", ".demo") DEMO_PATH = os.getenv("DEMO_PATH", "./demo")
TMP_PATH = os.getenv("TMP_PATH", "./demo/temp") TMP_PATH = os.getenv("TMP_PATH", "./demo/temp")
MODELS_PATH = os.getenv("MODELS_PATH", "./pretrained_models") MODELS_PATH = os.getenv("MODELS_PATH", "./pretrained_models")
device = "cuda" if torch.cuda.is_available() else "cpu" device = "cuda" if torch.cuda.is_available() else "cpu"
@ -371,20 +371,20 @@ demo_original_transcript = " But when I had approached so near to them, the comm
demo_text = { demo_text = {
"TTS": { "TTS": {
"smart": "I cannot believe that the same model can also do text to speech synthesis as well!", "smart": "I cannot believe that the same model can also do text to speech synthesis too!",
"regular": "But when I had approached so near to them, the common I cannot believe that the same model can also do text to speech synthesis as well!" "regular": "But when I had approached so near to them, the common I cannot believe that the same model can also do text to speech synthesis too!"
}, },
"Edit": { "Edit": {
"smart": "saw the mirage of the lake in the distance,", "smart": "saw the mirage of the lake in the distance,",
"regular": "But when I saw the mirage of the lake in the distance, which the sense deceives, Lost not by distance any of its marks," "regular": "But when I saw the mirage of the lake in the distance, which the sense deceives, Lost not by distance any of its marks,"
}, },
"Long TTS": { "Long TTS": {
"smart": "You can run TTS on a big text!\n" "smart": "You can run the model on a big text!\n"
"Just write it line-by-line. Or sentence-by-sentence.\n" "Just write it line-by-line. Or sentence-by-sentence.\n"
"If some sentences sound odd, just rerun TTS on them, no need to generate the whole text again!", "If some sentences sound odd, just rerun the model on them, no need to generate the whole text again!",
"regular": "But when I had approached so near to them, the common You can run TTS on a big text!\n" "regular": "But when I had approached so near to them, the common You can run the model on a big text!\n"
"But when I had approached so near to them, the common Just write it line-by-line. Or sentence-by-sentence.\n" "But when I had approached so near to them, the common Just write it line-by-line. Or sentence-by-sentence.\n"
"But when I had approached so near to them, the common If some sentences sound odd, just rerun TTS on them, no need to generate the whole text again!" "But when I had approached so near to them, the common If some sentences sound odd, just rerun the model on them, no need to generate the whole text again!"
} }
} }
@ -602,9 +602,9 @@ if __name__ == "__main__":
parser = argparse.ArgumentParser(description="VoiceCraft gradio app.") parser = argparse.ArgumentParser(description="VoiceCraft gradio app.")
parser.add_argument("--demo-path", default=".demo", help="Path to demo directory") parser.add_argument("--demo-path", default="./demo", help="Path to demo directory")
parser.add_argument("--tmp-path", default=".demo/temp", help="Path to tmp directory") parser.add_argument("--tmp-path", default="./demo/temp", help="Path to tmp directory")
parser.add_argument("--models-path", default=".pretrained_models", help="Path to voicecraft models directory") parser.add_argument("--models-path", default="./pretrained_models", help="Path to voicecraft models directory")
parser.add_argument("--port", default=7860, type=int, help="App port") parser.add_argument("--port", default=7860, type=int, help="App port")
parser.add_argument("--share", action="store_true", help="Launch with public url") parser.add_argument("--share", action="store_true", help="Launch with public url")

View File

@ -18,7 +18,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"!git clone https://github.com/zuev-stepan/VoiceCraft-gradio" "!git clone https://github.com/jasonppy/VoiceCraft"
] ]
}, },
{ {