weights, notebook working

This commit is contained in:
jason-on-salt-a40 2024-03-28 16:21:30 -07:00
parent a129883910
commit ac73066eb7
6 changed files with 389 additions and 84 deletions

2
.gitignore vendored
View File

@ -15,6 +15,8 @@ thumbs.db
*.png
*.wav
*.mp3
*.pth
*.th
*durip*
*rtx*

View File

@ -1,11 +1,15 @@
# VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
[Demo](https://jasonppy.github.io/VoiceCraft_web) [Paper](https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf)
### TL;DR
VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both **speech editing** and **zero-shot text-to-speech (TTS)** on in-the-wild data including audiobooks, internet videos, and podcasts.
To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.
## News
:star: 03/28/2024: Model weights are up on HuggingFace🤗 [here](https://huggingface.co/pyp1/VoiceCraft/tree/main)!
## TODO
The TODOs left will be completed by the end of March 2024.
@ -13,8 +17,9 @@ The TODOs left will be completed by the end of March 2024.
- [x] Environment setup
- [x] Inference demo for speech editing and TTS
- [x] Training guidance
- [x] Upload the RealEdit dataset and training manifest
- [ ] Upload model weights (encodec weights are up)
- [x] RealEdit dataset and training manifest
- [x] Model weights (both 330M and 830M, the former seems to be just as good but way faster)
- [ ] More
## Environment setup

View File

@ -1,12 +1,12 @@
Begin,End,Label,Type,Speaker
0.03,0.18,but,words,temp
0.18,0.32,when,words,temp
0.32,0.49,i,words,temp
0.49,0.64,had,words,temp
0.32,0.48,i,words,temp
0.48,0.64,had,words,temp
0.64,1.19,approached,words,temp
1.22,1.58,so,words,temp
1.58,1.9,near,words,temp
1.9,2.07,to,words,temp
1.58,1.91,near,words,temp
1.91,2.07,to,words,temp
2.07,2.42,them,words,temp
2.53,2.61,the,words,temp
2.61,3.01,common,words,temp
@ -19,8 +19,8 @@ Begin,End,Label,Type,Speaker
5.54,6.0,not,words,temp
6.0,6.14,by,words,temp
6.14,6.67,distance,words,temp
6.79,7.06,any,words,temp
7.06,7.18,of,words,temp
6.79,7.05,any,words,temp
7.05,7.18,of,words,temp
7.18,7.34,its,words,temp
7.34,7.87,marks,words,temp
0.03,0.06,B,phones,temp
@ -29,22 +29,22 @@ Begin,End,Label,Type,Speaker
0.18,0.23,W,phones,temp
0.23,0.27,EH1,phones,temp
0.27,0.32,N,phones,temp
0.32,0.49,AY1,phones,temp
0.49,0.5,HH,phones,temp
0.5,0.6,AE1,phones,temp
0.32,0.48,AY1,phones,temp
0.48,0.49,HH,phones,temp
0.49,0.6,AE1,phones,temp
0.6,0.64,D,phones,temp
0.64,0.7,AH0,phones,temp
0.7,0.83,P,phones,temp
0.83,0.87,R,phones,temp
0.87,0.99,OW1,phones,temp
0.83,0.88,R,phones,temp
0.88,0.99,OW1,phones,temp
0.99,1.12,CH,phones,temp
1.12,1.19,T,phones,temp
1.22,1.4,S,phones,temp
1.4,1.58,OW1,phones,temp
1.58,1.7,N,phones,temp
1.7,1.84,IH1,phones,temp
1.84,1.9,R,phones,temp
1.9,2.01,T,phones,temp
1.84,1.91,R,phones,temp
1.91,2.01,T,phones,temp
2.01,2.07,AH0,phones,temp
2.07,2.13,DH,phones,temp
2.13,2.3,EH1,phones,temp
@ -75,8 +75,8 @@ Begin,End,Label,Type,Speaker
4.34,4.42,D,phones,temp
4.42,4.45,IH0,phones,temp
4.45,4.59,S,phones,temp
4.59,4.8,IY1,phones,temp
4.8,4.87,V,phones,temp
4.59,4.79,IY1,phones,temp
4.79,4.87,V,phones,temp
4.87,4.97,Z,phones,temp
5.04,5.12,L,phones,temp
5.12,5.33,AO1,phones,temp
@ -96,14 +96,14 @@ Begin,End,Label,Type,Speaker
6.57,6.67,S,phones,temp
6.79,6.89,EH1,phones,temp
6.89,6.95,N,phones,temp
6.95,7.06,IY0,phones,temp
7.06,7.13,AH0,phones,temp
6.95,7.05,IY0,phones,temp
7.05,7.13,AH0,phones,temp
7.13,7.18,V,phones,temp
7.18,7.22,IH0,phones,temp
7.22,7.29,T,phones,temp
7.29,7.34,S,phones,temp
7.34,7.39,M,phones,temp
7.39,7.49,AA1,phones,temp
7.49,7.58,R,phones,temp
7.58,7.69,K,phones,temp
7.69,7.87,S,phones,temp
7.39,7.5,AA1,phones,temp
7.5,7.58,R,phones,temp
7.58,7.7,K,phones,temp
7.7,7.87,S,phones,temp

1 Begin End Label Type Speaker
2 0.03 0.18 but words temp
3 0.18 0.32 when words temp
4 0.32 0.49 0.48 i words temp
5 0.49 0.48 0.64 had words temp
6 0.64 1.19 approached words temp
7 1.22 1.58 so words temp
8 1.58 1.9 1.91 near words temp
9 1.9 1.91 2.07 to words temp
10 2.07 2.42 them words temp
11 2.53 2.61 the words temp
12 2.61 3.01 common words temp
19 5.54 6.0 not words temp
20 6.0 6.14 by words temp
21 6.14 6.67 distance words temp
22 6.79 7.06 7.05 any words temp
23 7.06 7.05 7.18 of words temp
24 7.18 7.34 its words temp
25 7.34 7.87 marks words temp
26 0.03 0.06 B phones temp
29 0.18 0.23 W phones temp
30 0.23 0.27 EH1 phones temp
31 0.27 0.32 N phones temp
32 0.32 0.49 0.48 AY1 phones temp
33 0.49 0.48 0.5 0.49 HH phones temp
34 0.5 0.49 0.6 AE1 phones temp
35 0.6 0.64 D phones temp
36 0.64 0.7 AH0 phones temp
37 0.7 0.83 P phones temp
38 0.83 0.87 0.88 R phones temp
39 0.87 0.88 0.99 OW1 phones temp
40 0.99 1.12 CH phones temp
41 1.12 1.19 T phones temp
42 1.22 1.4 S phones temp
43 1.4 1.58 OW1 phones temp
44 1.58 1.7 N phones temp
45 1.7 1.84 IH1 phones temp
46 1.84 1.9 1.91 R phones temp
47 1.9 1.91 2.01 T phones temp
48 2.01 2.07 AH0 phones temp
49 2.07 2.13 DH phones temp
50 2.13 2.3 EH1 phones temp
75 4.34 4.42 D phones temp
76 4.42 4.45 IH0 phones temp
77 4.45 4.59 S phones temp
78 4.59 4.8 4.79 IY1 phones temp
79 4.8 4.79 4.87 V phones temp
80 4.87 4.97 Z phones temp
81 5.04 5.12 L phones temp
82 5.12 5.33 AO1 phones temp
96 6.57 6.67 S phones temp
97 6.79 6.89 EH1 phones temp
98 6.89 6.95 N phones temp
99 6.95 7.06 7.05 IY0 phones temp
100 7.06 7.05 7.13 AH0 phones temp
101 7.13 7.18 V phones temp
102 7.18 7.22 IH0 phones temp
103 7.22 7.29 T phones temp
104 7.29 7.34 S phones temp
105 7.34 7.39 M phones temp
106 7.39 7.49 7.5 AA1 phones temp
107 7.49 7.5 7.58 R phones temp
108 7.58 7.69 7.7 K phones temp
109 7.69 7.7 7.87 S phones temp

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File