Merge pull request #2 from Sewlell/master

merge colab notebook
This commit is contained in:
zuev-stepan 2024-04-11 15:17:27 +05:00 committed by GitHub
commit 9344110b0c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 138 additions and 3 deletions

View File

@ -1,3 +1,9 @@
# VoiceCraft Gradio Colab
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Sewlell/VoiceCraft-gradio-colab/blob/master/voicecraft.ipynb)
Made for those who lacked a dedicated GPU and those who wanted [the friendly GUI by @zuev-stepan](https://github.com/zuev-stepan/VoiceCraft-gradio). Potato programmer brain here so all code credits to @jasonppy, @zuev-stepan and others who contributed in their code.
# VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
[Demo](https://jasonppy.github.io/VoiceCraft_web) [Paper](https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf)

View File

@ -1,6 +1,4 @@
import os
# os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
# os.environ["CUDA_VISIBLE_DEVICES"] = "0" # for local use
import gradio as gr
import torch
import torchaudio
@ -14,6 +12,10 @@ import numpy as np
import random
import uuid
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.chdir("/content/VoiceCraft-gradio-colab")
os.environ['USER'] = 'aaa'
TMP_PATH = os.getenv("TMP_PATH", "./demo/temp")
device = "cuda" if torch.cuda.is_available() else "cpu"
@ -596,4 +598,4 @@ with gr.Blocks() as app:
if __name__ == "__main__":
app.launch()
app.launch(share=True)

127
voicecraft.ipynb Normal file
View File

@ -0,0 +1,127 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"gpuType": "T4",
"authorship_tag": "ABX9TyPsqFhtOeQ18CXHnRkWAQSk",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/Sewlell/VoiceCraft-gradio-colab/blob/master/voicecraft.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Y87ixxsUVIhM"
},
"outputs": [],
"source": [
"!git clone https://github.com/Sewlell/VoiceCraft-gradio-colab"
]
},
{
"cell_type": "code",
"source": [
"!pip install tensorboard\n",
"!pip install phonemizer\n",
"!pip install datasets\n",
"!pip install torchmetrics\n",
"\n",
"!apt-get install -y espeak espeak-data libespeak1 libespeak-dev\n",
"!apt-get install -y festival*\n",
"!apt-get install -y build-essential\n",
"!apt-get install -y flac libasound2-dev libsndfile1-dev vorbis-tools\n",
"!apt-get install -y libxml2-dev libxslt-dev zlib1g-dev\n",
"\n",
"!pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft\n",
"\n",
"!pip install -r \"/content/VoiceCraft-gradio-colab/gradio_requirements.txt\""
],
"metadata": {
"id": "-w3USR91XdxY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Let it restarted, it won't let your entire installation be aborted."
],
"metadata": {
"id": "jNuzjrtmv2n1"
}
},
{
"cell_type": "markdown",
"source": [
"# Note before launching the `gradio_app.py`\n",
"\n",
"***You will get JSON warning if you move anything beside `sample_batch_size`, `stop_repetition` and `seed`.*** Which for most advanced setting, `kvache` and `temperature` unable to set in different value.\n",
"\n",
"You will download a .file File when you download the output audio for some reason. You will need to **convert the file from .snd to .wav/.mp3 manually**. Or if you enable showing file type in the name in Windows or wherever you are, change the file name to \"xxx.wav\" or \"xxx.mp3\". (know the solution? pull request my repository)\n",
"\n",
"Frequency of VRAM spikes no longer exist as well in April 5 Update.\n",
"* Nevermind, I have observed some weird usage on Colab's GPU Memory Monitor. It can spike up to 13.5GB VRAM even in WhisperX mode. (April 11)"
],
"metadata": {
"id": "AnqGEwZ4NxtJ"
}
},
{
"cell_type": "markdown",
"source": [
"Don't make your `prompt end time` too long, 6-9s is fine. Or else it will **either raise up JSON issue or cut off your generated audio**. This one is due to how VoiceCraft worked (so probably unfixable). It will add those text you want to get audio from at the end of the input audio transcript. It was way too much word for application or code to handle as it added up with original transcript. So please keep it short.\n",
"\n",
"Your total audio length (`prompt end time` + add-up audio) must not exceed 16 or 17s."
],
"metadata": {
"id": "dE0W76cMN3Si"
}
},
{
"cell_type": "markdown",
"source": [
"For voice cloning, I suggest you to probably have a monotone input to feed the voice cloning. Of course you can always try input that have tons of tone variety, but I find that as per April 11 Update, it's much more easy to replicate in monotone rather than audio that have laugh, scream, crying inside.\n",
"\n",
"The inference speed is much stable. With sample text, T4 (Free Tier Colab GPU) can do 6-15s on 6s-8s of `prompt end time`.\n",
"\n",
"I haven't test the Edit mode yet as those are not my focus, but you can try it."
],
"metadata": {
"id": "nnu2cY4t8P6X"
}
},
{
"cell_type": "code",
"source": [
"!python \"/content/VoiceCraft-gradio-colab/gradio_app.py\""
],
"metadata": {
"id": "NDt4r4DiXAwG"
},
"execution_count": null,
"outputs": []
}
]
}