{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Y87ixxsUVIhM" }, "outputs": [], "source": [ "!git clone https://github.com/jasonppy/VoiceCraft" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-w3USR91XdxY" }, "outputs": [], "source": [ "!pip install tensorboard\n", "!pip install phonemizer\n", "!pip install datasets\n", "!pip install torchmetrics\n", "\n", "!apt-get install -y espeak espeak-data libespeak1 libespeak-dev\n", "!apt-get install -y festival*\n", "!apt-get install -y build-essential\n", "!apt-get install -y flac libasound2-dev libsndfile1-dev vorbis-tools\n", "!apt-get install -y libxml2-dev libxslt-dev zlib1g-dev\n", "\n", "!pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft\n", "\n", "!pip install -r \"/content/VoiceCraft/gradio_requirements.txt\"" ] }, { "cell_type": "markdown", "metadata": { "id": "jNuzjrtmv2n1" }, "source": [ "# Let it restarted, it won't let your entire installation be aborted." ] }, { "cell_type": "markdown", "metadata": { "id": "AnqGEwZ4NxtJ" }, "source": [ "# Note before launching the `gradio_app.py`\n", "\n", "***You will get JSON warning if you move anything beside `sample_batch_size`, `stop_repetition` and `seed`.*** Which for most advanced setting, `kvache` and `temperature` unable to set in different value.\n", "\n", "You will download a .file File when you download the output audio for some reason. You will need to **convert the file from .snd to .wav/.mp3 manually**. Or if you enable showing file type in the name in Windows or wherever you are, change the file name to \"xxx.wav\" or \"xxx.mp3\". (know the solution? pull request my repository)\n", "\n", "Frequency of VRAM spikes no longer exist as well in April 5 Update.\n", "* Nevermind, I have observed some weird usage on Colab's GPU Memory Monitor. It can spike up to 13.5GB VRAM even in WhisperX mode. (April 11)" ] }, { "cell_type": "markdown", "metadata": { "id": "dE0W76cMN3Si" }, "source": [ "Don't make your `prompt end time` too long, 6-9s is fine. Or else it will **either raise up JSON issue or cut off your generated audio**. This one is due to how VoiceCraft worked (so probably unfixable). It will add those text you want to get audio from at the end of the input audio transcript. It was way too much word for application or code to handle as it added up with original transcript. So please keep it short.\n", "\n", "Your total audio length (`prompt end time` + add-up audio) must not exceed 16 or 17s." ] }, { "cell_type": "markdown", "metadata": { "id": "nnu2cY4t8P6X" }, "source": [ "For voice cloning, I suggest you to probably have a monotone input to feed the voice cloning. Of course you can always try input that have tons of tone variety, but I find that as per April 11 Update, it's much more easy to replicate in monotone rather than audio that have laugh, scream, crying inside.\n", "\n", "The inference speed is much stable. With sample text, T4 (Free Tier Colab GPU) can do 6-15s on 6s-8s of `prompt end time`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NDt4r4DiXAwG" }, "outputs": [], "source": [ "!python /content/VoiceCraft/gradio_app.py --demo-path=/content/VoiceCraft/demo --tmp-path=/content/VoiceCraft/demo/temp --models-path=/content/VoiceCraft/pretrained_models --share" ] } ], "metadata": { "accelerator": "GPU", "colab": { "authorship_tag": "ABX9TyPsqFhtOeQ18CXHnRkWAQSk", "gpuType": "T4", "include_colab_link": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }