finetune 830M

2024-04-08 15:12:51 -07:00 · 2024-04-08 15:12:51 -07:00 · 778db3443d
parent a31be7023f
commit 778db3443d
2 changed files with 4 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -100,7 +100,7 @@ conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=
 # install MFA english dictionary and model
 mfa model download dictionary english_us_arpa
 mfa model download acoustic english_us_arpa
-pip install huggingface_hub
+# pip install huggingface_hub
 # conda install pocl # above gives an warning for installing pocl, not sure if really need this

 # to run ipynb
@ -154,7 +154,7 @@ bash e830M.sh
 It's the same procedure to prepare your own custom dataset. Make sure that if 

 ## Finetuning
-You also need to do step 1-4 as Training, and I recommend to use AdamW for optimization if you finetune a pretrained model for better stability. checkout script `/home/pyp/VoiceCraft/z_scripts/e830M_ft.sh`.
+You also need to do step 1-4 as Training, and I recommend to use AdamW for optimization if you finetune a pretrained model for better stability. checkout script `./z_scripts/e830M_ft.sh`.

 If your dataset introduce new phonemes (which is very likely) that doesn't exist in the giga checkpoint, make sure you combine the original phonemes with the phoneme from your data when construction vocab. And you need to adjust `--text_vocab_size` and `--text_pad_token` so that the former is bigger than or equal to you vocab size, and the latter has the same value as `--text_vocab_size` (i.e. `--text_pad_token` is always the last token). Also since the text embedding are now of a different size, make sure you modify the weights loading part so that I won't crash (you could skip loading `text_embedding` or only load the existing part, and randomly initialize the new)

--- a/z_scripts/e830M_ft.sh
+++ b/z_scripts/e830M_ft.sh
@ -11,7 +11,7 @@ exp_root="path/to/store/exp_results"
 exp_name=e830M_ft
 dataset_dir="path/to/stored_extracted_codes_and_phonemes/xl" # xs if you only extracted xs in previous step
 encodec_codes_folder_name="encodec_16khz_4codebooks"
-load_model_from="/home/pyp/VoiceCraft/pretrained_models/giga830M.pth"
+load_model_from="./pretrained_models/giga830M.pth"

 # export CUDA_LAUNCH_BLOCKING=1 # for debugging

@ -34,7 +34,7 @@ torchrun --nnodes=1 --rdzv-backend=c10d --rdzv-endpoint=localhost:41977 --nproc_
 --nhead 16 \
 --num_decoder_layers 16 \
 --max_num_tokens 20000 \
--gradient_accumulation_steps 20 \
+--gradient_accumulation_steps 12 \
 --val_max_num_tokens 6000 \
 --num_buckets 6 \
 --audio_max_length 20 \