mirror of
https://github.com/jasonppy/VoiceCraft.git
synced 2025-06-05 21:49:11 +02:00
weights, notebook working
This commit is contained in:
2
.gitignore
vendored
2
.gitignore
vendored
@@ -15,6 +15,8 @@ thumbs.db
|
|||||||
*.png
|
*.png
|
||||||
*.wav
|
*.wav
|
||||||
*.mp3
|
*.mp3
|
||||||
|
*.pth
|
||||||
|
*.th
|
||||||
|
|
||||||
*durip*
|
*durip*
|
||||||
*rtx*
|
*rtx*
|
||||||
|
@@ -1,11 +1,15 @@
|
|||||||
# VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
|
# VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
|
||||||
[Demo](https://jasonppy.github.io/VoiceCraft_web) [Paper](https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf)
|
[Demo](https://jasonppy.github.io/VoiceCraft_web) [Paper](https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf)
|
||||||
|
|
||||||
|
|
||||||
### TL;DR
|
### TL;DR
|
||||||
VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both **speech editing** and **zero-shot text-to-speech (TTS)** on in-the-wild data including audiobooks, internet videos, and podcasts.
|
VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both **speech editing** and **zero-shot text-to-speech (TTS)** on in-the-wild data including audiobooks, internet videos, and podcasts.
|
||||||
|
|
||||||
To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.
|
To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.
|
||||||
|
|
||||||
|
## News
|
||||||
|
:star: 03/28/2024: Model weights are up on HuggingFace🤗 [here](https://huggingface.co/pyp1/VoiceCraft/tree/main)!
|
||||||
|
|
||||||
|
|
||||||
## TODO
|
## TODO
|
||||||
The TODOs left will be completed by the end of March 2024.
|
The TODOs left will be completed by the end of March 2024.
|
||||||
@@ -13,8 +17,9 @@ The TODOs left will be completed by the end of March 2024.
|
|||||||
- [x] Environment setup
|
- [x] Environment setup
|
||||||
- [x] Inference demo for speech editing and TTS
|
- [x] Inference demo for speech editing and TTS
|
||||||
- [x] Training guidance
|
- [x] Training guidance
|
||||||
- [x] Upload the RealEdit dataset and training manifest
|
- [x] RealEdit dataset and training manifest
|
||||||
- [ ] Upload model weights (encodec weights are up)
|
- [x] Model weights (both 330M and 830M, the former seems to be just as good but way faster)
|
||||||
|
- [ ] More
|
||||||
|
|
||||||
|
|
||||||
## Environment setup
|
## Environment setup
|
||||||
|
@@ -1,12 +1,12 @@
|
|||||||
Begin,End,Label,Type,Speaker
|
Begin,End,Label,Type,Speaker
|
||||||
0.03,0.18,but,words,temp
|
0.03,0.18,but,words,temp
|
||||||
0.18,0.32,when,words,temp
|
0.18,0.32,when,words,temp
|
||||||
0.32,0.49,i,words,temp
|
0.32,0.48,i,words,temp
|
||||||
0.49,0.64,had,words,temp
|
0.48,0.64,had,words,temp
|
||||||
0.64,1.19,approached,words,temp
|
0.64,1.19,approached,words,temp
|
||||||
1.22,1.58,so,words,temp
|
1.22,1.58,so,words,temp
|
||||||
1.58,1.9,near,words,temp
|
1.58,1.91,near,words,temp
|
||||||
1.9,2.07,to,words,temp
|
1.91,2.07,to,words,temp
|
||||||
2.07,2.42,them,words,temp
|
2.07,2.42,them,words,temp
|
||||||
2.53,2.61,the,words,temp
|
2.53,2.61,the,words,temp
|
||||||
2.61,3.01,common,words,temp
|
2.61,3.01,common,words,temp
|
||||||
@@ -19,8 +19,8 @@ Begin,End,Label,Type,Speaker
|
|||||||
5.54,6.0,not,words,temp
|
5.54,6.0,not,words,temp
|
||||||
6.0,6.14,by,words,temp
|
6.0,6.14,by,words,temp
|
||||||
6.14,6.67,distance,words,temp
|
6.14,6.67,distance,words,temp
|
||||||
6.79,7.06,any,words,temp
|
6.79,7.05,any,words,temp
|
||||||
7.06,7.18,of,words,temp
|
7.05,7.18,of,words,temp
|
||||||
7.18,7.34,its,words,temp
|
7.18,7.34,its,words,temp
|
||||||
7.34,7.87,marks,words,temp
|
7.34,7.87,marks,words,temp
|
||||||
0.03,0.06,B,phones,temp
|
0.03,0.06,B,phones,temp
|
||||||
@@ -29,22 +29,22 @@ Begin,End,Label,Type,Speaker
|
|||||||
0.18,0.23,W,phones,temp
|
0.18,0.23,W,phones,temp
|
||||||
0.23,0.27,EH1,phones,temp
|
0.23,0.27,EH1,phones,temp
|
||||||
0.27,0.32,N,phones,temp
|
0.27,0.32,N,phones,temp
|
||||||
0.32,0.49,AY1,phones,temp
|
0.32,0.48,AY1,phones,temp
|
||||||
0.49,0.5,HH,phones,temp
|
0.48,0.49,HH,phones,temp
|
||||||
0.5,0.6,AE1,phones,temp
|
0.49,0.6,AE1,phones,temp
|
||||||
0.6,0.64,D,phones,temp
|
0.6,0.64,D,phones,temp
|
||||||
0.64,0.7,AH0,phones,temp
|
0.64,0.7,AH0,phones,temp
|
||||||
0.7,0.83,P,phones,temp
|
0.7,0.83,P,phones,temp
|
||||||
0.83,0.87,R,phones,temp
|
0.83,0.88,R,phones,temp
|
||||||
0.87,0.99,OW1,phones,temp
|
0.88,0.99,OW1,phones,temp
|
||||||
0.99,1.12,CH,phones,temp
|
0.99,1.12,CH,phones,temp
|
||||||
1.12,1.19,T,phones,temp
|
1.12,1.19,T,phones,temp
|
||||||
1.22,1.4,S,phones,temp
|
1.22,1.4,S,phones,temp
|
||||||
1.4,1.58,OW1,phones,temp
|
1.4,1.58,OW1,phones,temp
|
||||||
1.58,1.7,N,phones,temp
|
1.58,1.7,N,phones,temp
|
||||||
1.7,1.84,IH1,phones,temp
|
1.7,1.84,IH1,phones,temp
|
||||||
1.84,1.9,R,phones,temp
|
1.84,1.91,R,phones,temp
|
||||||
1.9,2.01,T,phones,temp
|
1.91,2.01,T,phones,temp
|
||||||
2.01,2.07,AH0,phones,temp
|
2.01,2.07,AH0,phones,temp
|
||||||
2.07,2.13,DH,phones,temp
|
2.07,2.13,DH,phones,temp
|
||||||
2.13,2.3,EH1,phones,temp
|
2.13,2.3,EH1,phones,temp
|
||||||
@@ -75,8 +75,8 @@ Begin,End,Label,Type,Speaker
|
|||||||
4.34,4.42,D,phones,temp
|
4.34,4.42,D,phones,temp
|
||||||
4.42,4.45,IH0,phones,temp
|
4.42,4.45,IH0,phones,temp
|
||||||
4.45,4.59,S,phones,temp
|
4.45,4.59,S,phones,temp
|
||||||
4.59,4.8,IY1,phones,temp
|
4.59,4.79,IY1,phones,temp
|
||||||
4.8,4.87,V,phones,temp
|
4.79,4.87,V,phones,temp
|
||||||
4.87,4.97,Z,phones,temp
|
4.87,4.97,Z,phones,temp
|
||||||
5.04,5.12,L,phones,temp
|
5.04,5.12,L,phones,temp
|
||||||
5.12,5.33,AO1,phones,temp
|
5.12,5.33,AO1,phones,temp
|
||||||
@@ -96,14 +96,14 @@ Begin,End,Label,Type,Speaker
|
|||||||
6.57,6.67,S,phones,temp
|
6.57,6.67,S,phones,temp
|
||||||
6.79,6.89,EH1,phones,temp
|
6.79,6.89,EH1,phones,temp
|
||||||
6.89,6.95,N,phones,temp
|
6.89,6.95,N,phones,temp
|
||||||
6.95,7.06,IY0,phones,temp
|
6.95,7.05,IY0,phones,temp
|
||||||
7.06,7.13,AH0,phones,temp
|
7.05,7.13,AH0,phones,temp
|
||||||
7.13,7.18,V,phones,temp
|
7.13,7.18,V,phones,temp
|
||||||
7.18,7.22,IH0,phones,temp
|
7.18,7.22,IH0,phones,temp
|
||||||
7.22,7.29,T,phones,temp
|
7.22,7.29,T,phones,temp
|
||||||
7.29,7.34,S,phones,temp
|
7.29,7.34,S,phones,temp
|
||||||
7.34,7.39,M,phones,temp
|
7.34,7.39,M,phones,temp
|
||||||
7.39,7.49,AA1,phones,temp
|
7.39,7.5,AA1,phones,temp
|
||||||
7.49,7.58,R,phones,temp
|
7.5,7.58,R,phones,temp
|
||||||
7.58,7.69,K,phones,temp
|
7.58,7.7,K,phones,temp
|
||||||
7.69,7.87,S,phones,temp
|
7.7,7.87,S,phones,temp
|
||||||
|
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
0
pretrained_models/.gitkeep
Normal file
0
pretrained_models/.gitkeep
Normal file
Reference in New Issue
Block a user