extraction,training,data,weights

2024-03-24 19:43:37 -07:00 · 2024-03-24 19:43:37 -07:00 · a129883910
parent d754e9109a
commit a129883910
7 changed files with 686 additions and 176 deletions
--- a/README.md
+++ b/README.md
@ -1,7 +1,7 @@
 # VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
 [Demo](https://jasonppy.github.io/VoiceCraft_web) [Paper](https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf)

-TL;DR:
+### TL;DR
 VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both **speech editing** and **zero-shot text-to-speech (TTS)** on in-the-wild data including audiobooks, internet videos, and podcasts.

 To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.
@ -12,22 +12,25 @@ The TODOs left will be completed by the end of March 2024.
 - [x] Codebase upload
 - [x] Environment setup
 - [x] Inference demo for speech editing and TTS
- [ ] Upload model weights
- [ ] Training guidance
- [ ] Upload the RealEdit dataset
+- [x] Training guidance
+- [x] Upload the RealEdit dataset and training manifest
+- [ ] Upload model weights (encodec weights are up)
+

 ## Environment setup
 ```bash
 conda create -n voicecraft python=3.9.16
 conda activate voicecraft

-pip install torch==2.0.1 torchaudio==2.0.2 # this assumes your system is compatible with CUDA 11.7, otherwise checkout https://pytorch.org/get-started/previous-versions/#v201
+pip install torch==2.0.1 # this assumes your system is compatible with CUDA 11.7, otherwise checkout https://pytorch.org/get-started/previous-versions/#v201
 apt-get install ffmpeg # if you don't already have ffmpeg installed
 pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
 apt-get install espeak-ng # backend for the phonemizer installed below
+pip install tensorboard=2.16.2
 pip install phonemizer==3.2.1
-pip install tensorboard
-pip install datasets==2.12.0
+pip install torchaudio==2.0.2
+pip install datasets==2.16.0
+pip install torchmetrics==0.11.1
 # install MFA for getting forced-alignment, this could take a few minutes
 conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068
 # conda install pocl # above gives an warning for installing pocl, not sure if really need this
@ -36,9 +39,51 @@ conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=
 conda install -n voicecraft ipykernel --update-deps --force-reinstall
 ```

+If you have encountered version issues when running things, checkout [environment.yml](./environment.yml) for exact matching.
+
 ## Inference Examples
 Checkout [`inference_speech_editing.ipynb`](./inference_speech_editing.ipynb) and [`inference_tts.ipynb`](./inference_tts.ipynb)

+## Training
+To train an VoiceCraft model, you need to prepare the following parts: 
+1. utterances and their transcripts
+2. encode the utterances into codes using e.g. Encodec
+3. convert transcripts into phoneme sequence, and a phoneme set (we named it vocab.txt)
+4. manifest (i.e. metadata)
+
+Step 1,2,3 are handled in [./data/phonemize_encodec_encode_hf.py](./data/phonemize_encodec_encode_hf.py), where 
+1. Gigaspeech is downloaded through HuggingFace. Note that you need to sign an agreement in order to download the dataset (it needs your auth token)
+2. phoneme sequence and encodec codes are also extracted using the script.
+
+An example run:
+
+```bash
+conda activate voicecraft
+export CUDA_VISIBLE_DEVICES=0
+cd ./data
+python phonemize_encodec_encode_hf.py \
+--dataset_size xs \
+--download_to path/to/store_huggingface_downloads \
+--save_dir path/to/store_extracted_codes_and_phonemes \
+--encodec_model_path path/to/encodec_model \
+--mega_batch_size 120 \
+--batch_size 32 \
+--max_len 30000
+```
+where encodec_model_path is avaliable [here](https://huggingface.co/pyp1/VoiceCraft). This model is trained on Gigaspeech XL, it has 56M parameters, 4 codebooks, each codebook has 2048 codes. Details are described in our [paper](https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf). If you encounter OOM during extraction, try decrease the batch_size and/or max_len.
+The extracted codes, phonemes, and vocab.txt will be stored at `path/to/store_extracted_codes_and_phonemes/${dataset_size}/{encodec_16khz_4codebooks,phonemes,vocab.txt}`.
+
+As for manifest, please download train.txt and validation.txt from [here](https://huggingface.co/datasets/pyp1/VoiceCraft_RealEdit/tree/main), and put them under `path/to/store_extracted_codes_and_phonemes/manifest/`. Please also download vocab.txt from [here](https://huggingface.co/datasets/pyp1/VoiceCraft_RealEdit/tree/main) if you want to use our pretrained VoiceCraft model (so that the phoneme-to-token matching is the same). 
+
+Now, you are good to start training!
+
+```bash
+conda activate voicecraft
+cd ./z_scripts
+bash e830M.sh
+```
+
+
 ## License
 The codebase is under CC BY-NC-SA 4.0 ([LICENSE-CODE](./LICENSE-CODE)), and the model weights are under Coqui Public Model License 1.0.0 ([LICENSE-MODEL](./LICENSE-MODEL)). Note that we use some of the code from other repository that are under different licenses: `./models/codebooks_patterns.py` is under MIT license; `./models/modules`, `./steps/optim.py`, `data/tokenizer.py` are under Apache License, Version 2.0; the phonemizer we used is under GNU 3.0 License. For drop-in replacement of the phonemizer (i.e. text to IPA phoneme mapping), try [g2p](https://github.com/roedoejet/g2p) (MIT License) or [OpenPhonemizer](https://github.com/NeuralVox/OpenPhonemizer) (BSD-3-Clause Clear), although these are not tested.

--- a/data/giga_preprocessing/encodec_encode.py
+++ b/data/giga_preprocessing/encodec_encode.py
@ -1,160 +0,0 @@
-import argparse
-def parse_args():
-    parser = argparse.ArgumentParser(description="encode the librilight dataset using encodec model")
-    parser.add_argument("--manifest_root", type=str, default="/home/pyp/audiocraft/egs/gigaspeech", help="this the dir of the audiocraft manifest!")
-    parser.add_argument('--audio_dir', type=str, default="/data/scratch/pyp/datasets/gigaspeech_flac", help="Path dirs of the flac audio files")
-    parser.add_argument('--save_dir', type=str, default="/data/scratch/pyp/datasets/gigaspeech_phn_enc_manifest/xl", help="path to the manifest, phonemes, and encodec codes dirs")
-    parser.add_argument('--encodec_model_path', type=str, default="/data/scratch/pyp/exp_pyp/audiocraft/encodec/xps/6f79c6a8/checkpoint.th")
-    parser.add_argument('--n_workers', type=int, default=32, help="Number of parallel worker processes")
-    parser.add_argument('--batch_size', type=int, default=64, help="batch size for encodec encoding, decrease it if OOM. This is the sum of batch size *over each gpu*, so increase it if you are using more gpus")
-    parser.add_argument('--model_sr', type=int, default=16000, help='encodec input audio sample rate')
-    parser.add_argument('--downsample_rate', type=int, default=320, help='encodec downsample rate')
-    parser.add_argument('--model_code_sr', type=int, default=50, help='encodec model code sample rate')
-    parser.add_argument('--len_cap', type=float, default=35.0, help='will drop audios that are longer than this number')
-    return parser.parse_args()
-
-if __name__ == "__main__":
-    import logging
-    formatter = (
-        "%(asctime)s [%(levelname)s] %(filename)s:%(lineno)d || %(message)s"
-    )
-    logging.basicConfig(format=formatter, level=logging.INFO)
-
-    import os
-    import numpy as np
-    import torch
-    import torchaudio
-    import tqdm
-    import time
-
-    args = parse_args()
-    
-    manifest_dir = args.manifest_root  # this dir is scp-ed
-    audio_dir = args.audio_dir # this is scp-ed flac dir
-    encodec_signature = args.encodec_model_path.split("/")[-2]
-    save_codes_dir = os.path.join(args.save_dir, f"encodec_16khz_{encodec_signature}")
-    os.makedirs(save_codes_dir, exist_ok=True)
-
-
-    # model_sr = 16000
-    # downsample_rate = 320
-    # model_code_sr = 50
-    def sort_by_audio_len(lens):
-        inds = np.argsort(lens).tolist()
-        logging.info(f"longest: {lens[inds[-1]]/args.downsample_rate} encodec codes, {lens[inds[-1]]/args.model_sr:.2f} sec.")
-        logging.info(f"shortest: {lens[inds[0]]/args.downsample_rate} encodec codes, {lens[inds[0]]/args.model_sr:.2f} sec.")
-        logging.info(f"median: {lens[inds[len(inds)//2]]/args.downsample_rate} encodec codes, {lens[inds[len(inds)//2]]/args.model_sr:.2f} sec.")
-        logging.info(f"95 percentile longest: {lens[inds[int(len(inds)*0.95)]]/args.downsample_rate} encodec codes, {lens[inds[int(len(inds)*0.95)]]/args.model_sr:.2f} sec.")
-        return inds[::-1]
-    
-    def write_array_to_txt_file(array, filename):
-        with open(filename, 'w') as f:
-            for a in array[:-1]:
-                f.write(' '.join(map(str, a))+'\n')
-            f.write(' '.join(map(str, array[-1])))
-
-    
-
-    class mydataset(torch.utils.data.Dataset):
-        def __init__(self, split):
-            super().__init__()
-            # self.data = gs[split]
-            self.split = split
-            self.audio_root = audio_dir
-            manifest_fn = os.path.join(manifest_dir, split+".txt")
-            with open(manifest_fn, "r") as rf:
-                self.data = [l.strip().split("\t") for l in rf.readlines()]
-        def __len__(self):
-            return len(self.data)
-        def __getitem__(self, ind):
-            try:
-                afn = self.data[ind][0]
-                fn = os.path.join(self.audio_root, afn)
-                audio, sr = torchaudio.load(fn)
-                assert sr == args.model_sr, sr
-            except Exception as e:
-                logging.info(f"{e}")
-                return None, None, None
-            assert audio.ndim==2 and audio.shape[0] == 1, audio.shape
-            return audio.type(torch.float32).squeeze(0), audio.shape[-1], os.path.basename(afn).split(".")[0]
-        def collate(self, batch):
-            lens, audios, segment_ids = [], [], []
-            for item in batch:
-                if item[0] != None:
-                    audios.append(item[0])
-                    lens.append(item[1])
-                    segment_ids.append(item[2])
-            return audios, lens, segment_ids
-
-    # load the encodec model
-    from audiocraft.solvers import CompressionSolver
-    model = CompressionSolver.model_from_checkpoint(args.encodec_model_path)
-    model = model.cuda()
-    model = model.eval()
-    model = torch.nn.DataParallel(model)
-
-
-    # setup dataloader
-    mega_batch_size = 2100
-    batch_size = args.batch_size
-    train_dataset = mydataset('train')
-    train_loader = torch.torch.utils.data.DataLoader(train_dataset, batch_size=mega_batch_size, shuffle=False, drop_last=False, num_workers=args.n_workers, collate_fn=train_dataset.collate)
-    validation_dataset = mydataset('validation')
-    validation_loader = torch.torch.utils.data.DataLoader(validation_dataset, batch_size=mega_batch_size, shuffle=False, drop_last=False, num_workers=args.n_workers, collate_fn=validation_dataset.collate)
-    test_dataset = mydataset('test')
-    test_loader = torch.torch.utils.data.DataLoader(test_dataset, batch_size=mega_batch_size, shuffle=False, drop_last=False, num_workers=args.n_workers, collate_fn=test_dataset.collate)
-    splits = ['validation', 'test', 'train']
-    loaders = [validation_loader, test_loader, train_loader]
-    # splits = ['validation'] # NOTE this is for debug, for example, see if the 
-    # loaders = [validation_loader]
-    for split, loader in zip(splits, loaders):
-        skip = 0
-        logging.info(f"now processing split {split}...")
-        mega_n_steps = int(np.ceil(len(loader.dataset) / mega_batch_size))
-        # mega_n_steps = int(np.ceil(len(gs) / mega_batch_size))
-        logging.info(f"partition the split {split} into {mega_n_steps} parts, each has {mega_batch_size} samples")
-        # with open(mani_fn, "a") as mani_wf: # resume from where we failed
-        for m, mega_batch in enumerate(loader):
-            logging.info(f"====================================")
-            logging.info(f"====================================")
-            logging.info(f"now processing mega step {m+1}/{mega_n_steps}")
-            lengths = np.array(mega_batch[1])
-            sorted_inds = sort_by_audio_len(lengths)
-            for j in range(len(sorted_inds))[::-1]:
-                if lengths[sorted_inds[j]] < args.model_sr*0.2 or lengths[sorted_inds[j]] > args.model_sr*args.len_cap: # skip samples that are too short (shorter than 0.2s), or too big (bigger than 80s)
-                    skip += 1
-                    del sorted_inds[j]
-            
-            n_steps = int(np.ceil(len(sorted_inds) / batch_size))
-            for n in tqdm.tqdm(range(n_steps), disable=True):
-                inds_used = sorted_inds[n*batch_size:(n+1)*batch_size]
-                wav_batch = [mega_batch[0][id] for id in inds_used]
-                all_lens = [mega_batch[1][id] for id in inds_used]
-                segment_id_batch = [mega_batch[2][id] for id in inds_used]
-                # print(segment_id_batch)
-                padded_wav = torch.nn.utils.rnn.pad_sequence(wav_batch, batch_first=True).unsqueeze(1) # [B, T] -> [B, 1, T]
-                with torch.no_grad():
-                    if max(all_lens) > 300000 and len(all_lens) > 1: # NOTE decrease this (300000) if OOM, or chunk it into more than 2 forward passes
-                        codes = []
-                        inwav = padded_wav.cuda()
-                        codes.append(model(inwav[:len(inwav)//2], encode=True)[0].cpu())
-                        codes.append(model(inwav[len(inwav)//2:], encode=True)[0].cpu())
-                        codes = torch.cat(codes, dim=0)
-                    else:
-                        encoded_frames = model(padded_wav.cuda(), encode=True) # wav needs to have shape [B, C, T], C is model.channels, which is 1 for the 24kHz encodec model
-                        # logging.info(f"encoded_frames: {encoded_frames[0].shape}")
-                        codes = encoded_frames[0].cpu()
-
-                for i, length in enumerate(all_lens):
-                    save_fn = os.path.join(save_codes_dir, segment_id_batch[i]+".txt")
-                    actual_len = round(length / args.downsample_rate) # 320 is downsample rate for this model
-                    cur_code = codes[i].tolist() if type(codes) == list else codes[i, :, :actual_len].tolist()
-                    write_array_to_txt_file(cur_code, save_fn)
-
-                    # mani_wf.write(f"0\t{segment_id_batch[i]}\t{len(cur_code[0])}\n") # write to manifest file
-                    # if i == 10:
-                    #    raise
-            # break
-        # logging.info(f"split {split} has {len(gs[split])} samples in total, skipped {skip} due to forbiden words")
-        logging.info(f"split {split} has {len(loader.dataset)} samples in total, skipped {skip} due to utterance being too long or too short")
-        # break
--- a/data/gigaspeech.py
+++ b/data/gigaspeech.py
@ -54,8 +54,6 @@ class dataset(torch.utils.data.Dataset):
                    y = [[int(n)+self.args.n_special for n in l] for l in encos]
                else:
                    y = [[int(n) for n in l] for l in encos]
-                if self.args.training_stage == 1 and not self.args.valle and not (self.args.musicgen or self.args.valle_orig):
-                    y = y[:1]
        except Exception as e:
            logging.info(f"loading failed for {pf} and {ef}, maybe files don't exist or are corrupted")
            logging.info(f"error message: {e}")
@ -141,15 +139,15 @@ class dataset(torch.utils.data.Dataset):
        if self.args.pad_x:
            res["x"] = torch.stack(out["x"], dim=0)
        else:
-            res["x"] = torch.nn.utils.rnn.pad_sequence(out["x"], batch_first=True, padding_value=0 if self.args.sep_special_token else self.args.text_pad_token)
+            res["x"] = torch.nn.utils.rnn.pad_sequence(out["x"], batch_first=True, padding_value=self.args.text_pad_token)
        res["x_lens"] = torch.LongTensor(out["x_len"])
        if self.args.dynamic_batching:
            if out['y'][0].ndim==2:
-                res['y'] = torch.nn.utils.rnn.pad_sequence([item.transpose(1,0) for item in out['y']],padding_value=0 if self.args.sep_special_token else self.args.audio_pad_token)
+                res['y'] = torch.nn.utils.rnn.pad_sequence([item.transpose(1,0) for item in out['y']],padding_value=self.args.audio_pad_token)
                res['y'] = res['y'].permute(1,2,0) # T B K -> B K T
            else:
                assert out['y'][0].ndim==1, out['y'][0].shape
-                res['y'] = torch.nn.utils.rnn.pad_sequence(out['y'], batch_first=True, padding_value=0 if self.args.sep_special_token else self.args.audio_pad_token)
+                res['y'] = torch.nn.utils.rnn.pad_sequence(out['y'], batch_first=True, padding_value=self.args.audio_pad_token)
        else:
            res['y'] = torch.stack(out['y'], dim=0)
        res["y_lens"] = torch.LongTensor(out["y_len"])
--- a/data/phonemize_encodec_encode_hf.py
+++ b/data/phonemize_encodec_encode_hf.py
@ -0,0 +1,206 @@
+import argparse
+def parse_args():
+    parser = argparse.ArgumentParser(description="encode the librilight dataset using encodec model")
+    parser.add_argument("--dataset_size", type=str, default='xs', help='sizes of gigaspeech, xs, s, m, l, xl. we use xl for VoiceCraft training, xs is good for debugging')
+    parser.add_argument('--download_to', type=str, default="/data/scratch/pyp/datasets/gigaspeech_debug", help="dir where you want the huggingface gigaspeech dataset to be downloaded to")
+    parser.add_argument('--save_dir', type=str, default="/data/scratch/pyp/datasets/gigaspeech_phn_enc_manifest_debug", help="path to the manifest, phonemes, and encodec codes dirs")
+    parser.add_argument('--encodec_model_path', type=str, default="/data/scratch/pyp/exp_pyp/audiocraft/encodec/xps/6f79c6a8/checkpoint.th")
+    parser.add_argument('--n_workers', type=int, default=4, help="Number of parallel worker processes")
+    parser.add_argument('--mega_batch_size', type=int, default=100, help="Number of samples in each mega batch for multiprocess dataloading")
+    parser.add_argument('--batch_size', type=int, default=4, help="batch size for encodec encoding, decrease it if OOM. This is the sum of batch size *over each gpu*, so increase it if you are using more gpus")
+    parser.add_argument('--model_sr', type=int, default=16000, help='encodec input audio sample rate')
+    parser.add_argument('--downsample_rate', type=int, default=320, help='encodec downsample rate')
+    parser.add_argument('--model_code_sr', type=int, default=50, help='encodec model code sample rate')
+    parser.add_argument('--len_cap', type=float, default=35.0, help='will drop audios that are longer than this number')
+    parser.add_argument('--max_len', type=int, default=30000, help='max length of audio in samples, if exceed, will cut a batch into half to process, decrease this number if OOM on your machine')
+    return parser.parse_args()
+if __name__ == "__main__":
+    import logging
+    formatter = (
+        "%(asctime)s [%(levelname)s] %(filename)s:%(lineno)d || %(message)s"
+    )
+    logging.basicConfig(format=formatter, level=logging.INFO)
+    args = parse_args()
+
+    import os
+    import numpy as np
+    import torch
+    import tqdm
+    import time
+    from datasets import load_dataset, DownloadConfig
+
+    from tokenizer import TextTokenizer, tokenize_text
+    
+    # get the path
+    phn_save_root = os.path.join(args.save_dir, args.dataset_size, "phonemes")
+    codes_save_root = os.path.join(args.save_dir, args.dataset_size, "encodec_16khz_4codebooks")
+    vocab_fn = os.path.join(args.save_dir, args.dataset_size, "vocab.txt")
+    os.makedirs(phn_save_root, exist_ok=True)
+    os.makedirs(codes_save_root, exist_ok=True)
+
+
+    def sort_by_audio_len(lens):
+        inds = np.argsort(lens).tolist()
+        logging.info(f"longest: {lens[inds[-1]]*args.model_code_sr} encodec codes, {lens[inds[-1]]:.2f} sec.")
+        logging.info(f"shortest: {lens[inds[0]]*args.model_code_sr} encodec codes, {lens[inds[0]]:.2f} sec.")
+        logging.info(f"median: {lens[inds[len(inds)//2]]*args.model_code_sr} encodec codes, {lens[inds[len(inds)//2]]:.2f} sec.")
+        logging.info(f"95 percentile longest: {lens[inds[int(len(inds)*0.95)]]*args.model_code_sr} encodec codes, {lens[inds[int(len(inds)*0.95)]]:.2f} sec.")
+        return inds[::-1]
+    
+    def write_array_to_txt_file(array, filename):
+        with open(filename, 'w') as f:
+            for a in array[:-1]:
+                f.write(' '.join(map(str, a))+'\n')
+            f.write(' '.join(map(str, array[-1])))
+    
+
+    ### phonemization
+    # load tokenizer
+    # load the encodec model
+    from audiocraft.solvers import CompressionSolver
+    model = CompressionSolver.model_from_checkpoint(args.encodec_model_path)
+    model = model.cuda()
+    model = model.eval()
+    text_tokenizer = TextTokenizer()
+
+
+    # https://github.com/SpeechColab/GigaSpeech
+    # there are only four different punctuations
+    # need to check whether there are other < started strings
+    punc2sym = {" <COMMA>": ",", " <PERIOD>": ".", " <QUESTIONMARK>": "?", " <EXCLAMATIONPOINT>": "!"} # note the space in front of each punc name
+    gar2sym = {"<SIL>": "#%#", "<MUSIC>": "##%", "<NOISE>": "%%#", "<OTHER>":"%#%"} # so that they are savely keep as the original sym when using tokenize_text
+    punc2sym.update(gar2sym)
+
+    word2sym = { "h æ ʃ h ɐ ʃ p ɚ s ɛ n t": "<MUSIC>", "h æ ʃ p ɚ s ɛ n t h æ ʃ": "<SIL>", "p ɚ s ɛ n t h ɐ ʃ p ɚ s ɛ n t": "<OTHER>", "p ɚ s ɛ n t p ɚ s ɛ n t h æ ʃ": "<NOISE>"}
+    forbidden_words = set(['#%#', '##%', '%%#', '%#%'])
+
+    dc = DownloadConfig(cache_dir=args.download_to)
+    stime = time.time()
+    logging.info("loading the dataset...")
+    gs = load_dataset("speechcolab/gigaspeech", args.dataset_size, use_auth_token=True, cache_dir = args.download_to, download_config=dc)
+    logging.info(f"time spend on loading the dataset: {time.time() - stime:.2f} seconds")
+
+    splits = ['validation', 'test', 'train']
+    
+    logging.info(f"gigaspeech dataset {args.dataset_size} info: {gs}")
+    logging.info(f"phonemizing...")
+    phn_vocab = set()
+    all_lens = []
+    
+    # you will see a ton of [WARNING] words_mismatch.py:88......, it's not a issue
+    for split in tqdm.tqdm(splits):
+        skip = 0
+        logging.info(f"now processing split {split}...")
+        for item in tqdm.tqdm(gs[split]):
+            save_fn = os.path.join(phn_save_root, item['segment_id']+".txt")
+            text = item['text']
+            if sum(word in forbidden_words for word in text.split(" ")):
+                logging.info(f"skip {item['segment_id']}, because it contains forbiden words. It's transcript: {text}")
+                skip += 1
+                continue
+            for k, v in punc2sym.items():
+                text = text.replace(k, v)
+            phn = tokenize_text(text_tokenizer, text)
+            phn_seq = " ".join(phn)
+            for k, v in word2sym.items():
+                phn_seq = phn_seq.replace(k, v)
+            phn_vocab.update(phn_seq.split(" "))
+            all_lens.append(len(phn_seq.split(" ")))
+            with open(save_fn, "w") as f:
+                f.write(phn_seq)
+        logging.info(f"split {split} has {len(gs[split])} samples in total, skipped {skip} due to forbiden words")
+
+    print(f"phn vocab size: {len(list(phn_vocab))}")
+    print("phn sequence stats: ")
+    print(f"longest: {max(all_lens)}")
+    print(f"shortest: {min(all_lens)}")
+    print(f"median: {np.quantile(all_lens, 0.5)}")
+    print(f"95 percentile longest: {np.quantile(all_lens, 0.95)}")
+    print("write vocabulary to ", vocab_fn)
+    with open(vocab_fn, "w") as f:
+        for i, phn in enumerate(list(phn_vocab)):
+            if i < len(list(phn_vocab)) - 1:
+                f.write(f"{str(i)} {phn}\n")
+            else:
+                f.write(f"{str(i)} {phn}")
+
+    class mydataset(torch.utils.data.Dataset):
+        def __init__(self, split):
+            super().__init__()
+            self.data = gs[split]
+        def __len__(self):
+            return len(self.data)
+        def __getitem__(self, ind):
+            try:
+                segment_id, audio, sr, text, begin_time, end_time = self.data[ind]['segment_id'], torch.from_numpy(self.data[ind]['audio']['array']).float(), self.data[ind]['audio']['sampling_rate'], self.data[ind]['text'], self.data[ind]['begin_time'], self.data[ind]['end_time']
+            except:
+                return None, None, None, None, None, None
+            
+            return segment_id, audio, sr, text, begin_time, end_time
+        def collate(self, batch):
+            res = {'segment_id': [], "audio": [], "sr": [], "text": [], "begin_time": [], "end_time": []}
+            for item in batch:
+                if item[0] != None:
+                    res['segment_id'].append(item[0])
+                    res['audio'].append(item[1])
+                    res['sr'].append(item[2])
+                    res['text'].append(item[3])
+                    res['begin_time'].append(item[4])
+                    res['end_time'].append(item[5])
+            return res
+
+
+    ## encodec codes extraction
+    logging.info("encodec encoding...")
+    train_dataset = mydataset('train')
+    train_loader = torch.torch.utils.data.DataLoader(train_dataset, batch_size=args.mega_batch_size, shuffle=False, drop_last=False, num_workers=args.n_workers, collate_fn=train_dataset.collate)
+    validation_dataset = mydataset('validation')
+    validation_loader = torch.torch.utils.data.DataLoader(validation_dataset, batch_size=args.mega_batch_size, shuffle=False, drop_last=False, num_workers=args.n_workers, collate_fn=validation_dataset.collate)
+    test_dataset = mydataset('test')
+    test_loader = torch.torch.utils.data.DataLoader(test_dataset, batch_size=args.mega_batch_size, shuffle=False, drop_last=False, num_workers=args.n_workers, collate_fn=test_dataset.collate)
+    splits = ['validation', 'test', 'train']
+    loaders = [validation_loader, test_loader, train_loader]
+    # splits = ['validation'] # for debug
+    # loaders = [validation_loader]
+    for split, loader in zip(splits, loaders):
+        skip = 0
+        logging.info(f"now processing split {split}...")
+        mega_n_steps = int(np.ceil(len(gs[split]) / args.mega_batch_size))
+        logging.info(f"partition the split {split} into {mega_n_steps} parts, each has {args.mega_batch_size} samples")
+        for m, mega_batch in enumerate(loader):
+            logging.info(f"====================================")
+            logging.info(f"====================================")
+            logging.info(f"now processing mega step {m+1}/{mega_n_steps}")
+            lengths = np.array(mega_batch['end_time']) - np.array(mega_batch['begin_time'])
+            sorted_inds = sort_by_audio_len(lengths)
+            for j in range(len(sorted_inds))[::-1]:
+                if lengths[sorted_inds[j]] < 0.2 or lengths[sorted_inds[j]] > args.len_cap: # skip samples that are too short (shorter than 0.2s), or too big (bigger than 80s)
+                    skip += 1
+                    del sorted_inds[j]
+            
+            n_steps = int(np.ceil(len(sorted_inds) / args.batch_size))
+            for n in tqdm.tqdm(range(n_steps), disable=True):
+                inds_used = sorted_inds[n*args.batch_size:(n+1)*args.batch_size]
+                audio_batch = [mega_batch['audio'][id] for id in inds_used]
+                sr_batch = [mega_batch['sr'][id] for id in inds_used]
+                segment_id_batch = [mega_batch['segment_id'][id] for id in inds_used]
+                text_batch = [mega_batch['text'][id] for id in inds_used]
+                padded_wav = torch.nn.utils.rnn.pad_sequence(audio_batch, batch_first=True).unsqueeze(1) # [B, T] -> [B, 1, T]
+                all_lens = [lengths[id] for id in inds_used]
+                with torch.no_grad():
+                    if max(all_lens) > args.max_len and len(all_lens) > 1: # NOTE decrease args.max_len if OOM, or chunk it into more than 2 forward passes
+                        codes = []
+                        inwav = padded_wav.cuda()
+                        codes.append(model.encode(inwav[:len(inwav)//2])[0].cpu())
+                        codes.append(model.encode(inwav[len(inwav)//2:])[0].cpu())
+                        codes = torch.cat(codes, dim=0)
+                    else:
+                        encoded_frames = model.encode(padded_wav.cuda())
+                        # logging.info(f"encoded_frames: {encoded_frames[0].shape}")
+                        codes = encoded_frames[0].cpu()
+
+                for i, length in enumerate(all_lens):
+                    save_fn = os.path.join(codes_save_root, segment_id_batch[i]+".txt")
+                    actual_len = round(length * args.model_code_sr) # 320 is downsample rate for this model
+                    cur_code = codes[i].tolist() if type(codes) == list else codes[i, :, :actual_len].tolist()
+                    write_array_to_txt_file(cur_code, save_fn)
--- a/environment.yml
+++ b/environment.yml
@ -0,0 +1,417 @@
+name: voicecraft
+channels:
+  - conda-forge
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=conda_forge
+  - _openmp_mutex=4.5=2_gnu
+  - aom=3.8.2=h59595ed_0
+  - asttokens=2.4.1=pyhd8ed1ab_0
+  - atk-1.0=2.38.0=hd4edc92_1
+  - audioread=3.0.1=py39hf3d152e_1
+  - backcall=0.2.0=pyh9f0ad1d_0
+  - baumwelch=0.3.7=h00ab1b0_5
+  - biopython=1.79=py39hb9d737c_3
+  - brotli=1.1.0=hd590300_1
+  - brotli-bin=1.1.0=hd590300_1
+  - brotli-python=1.1.0=py39h3d6467e_1
+  - bzip2=1.0.8=hd590300_5
+  - ca-certificates=2024.2.2=hbcca054_0
+  - cairo=1.18.0=h3faef2a_0
+  - certifi=2024.2.2=pyhd8ed1ab_0
+  - cffi=1.16.0=py39h7a31438_0
+  - charset-normalizer=3.3.2=pyhd8ed1ab_0
+  - click=8.1.7=unix_pyh707e725_0
+  - colorama=0.4.6=pyhd8ed1ab_0
+  - comm=0.2.2=pyhd8ed1ab_0
+  - contourpy=1.2.0=py39h7633fee_0
+  - cycler=0.12.1=pyhd8ed1ab_0
+  - dataclassy=1.0.1=pyhd8ed1ab_0
+  - dav1d=1.2.1=hd590300_0
+  - debugpy=1.8.1=py39h3d6467e_0
+  - decorator=5.1.1=pyhd8ed1ab_0
+  - executing=2.0.1=pyhd8ed1ab_0
+  - expat=2.6.2=h59595ed_0
+  - ffmpeg=6.1.1=gpl_h38e077a_106
+  - font-ttf-dejavu-sans-mono=2.37=hab24e00_0
+  - font-ttf-inconsolata=3.000=h77eed37_0
+  - font-ttf-source-code-pro=2.038=h77eed37_0
+  - font-ttf-ubuntu=0.83=h77eed37_1
+  - fontconfig=2.14.2=h14ed4e7_0
+  - fonts-conda-ecosystem=1=0
+  - fonts-conda-forge=1=0
+  - fonttools=4.49.0=py39hd1e30aa_0
+  - freetype=2.12.1=h267a509_2
+  - fribidi=1.0.10=h36c2ea0_0
+  - gdk-pixbuf=2.42.10=h829c605_5
+  - gettext=0.21.1=h27087fc_0
+  - giflib=5.2.1=h0b41bf4_3
+  - gmp=6.3.0=h59595ed_1
+  - gnutls=3.7.9=hb077bed_0
+  - graphite2=1.3.13=h58526e2_1001
+  - graphviz=9.0.0=h78e8752_1
+  - greenlet=3.0.3=py39h3d6467e_0
+  - gtk2=2.24.33=h280cfa0_4
+  - gts=0.7.6=h977cf35_4
+  - harfbuzz=8.3.0=h3d44ed6_0
+  - hdbscan=0.8.33=py39h44dd56e_4
+  - icu=73.2=h59595ed_0
+  - idna=3.6=pyhd8ed1ab_0
+  - importlib-metadata=7.0.2=pyha770c72_0
+  - importlib-resources=6.3.0=pyhd8ed1ab_0
+  - importlib_metadata=7.0.2=hd8ed1ab_0
+  - importlib_resources=6.3.0=pyhd8ed1ab_0
+  - ipykernel=6.29.3=pyhd33586a_0
+  - jedi=0.19.1=pyhd8ed1ab_0
+  - joblib=1.3.2=pyhd8ed1ab_0
+  - jupyter_client=8.6.1=pyhd8ed1ab_0
+  - jupyter_core=5.7.2=py39hf3d152e_0
+  - kaldi=5.5.1068=cpu_h31769b2_2
+  - keyutils=1.6.1=h166bdaf_0
+  - kiwisolver=1.4.5=py39h7633fee_1
+  - kneed=0.8.5=pyhd8ed1ab_0
+  - krb5=1.21.2=h659d440_0
+  - lame=3.100=h166bdaf_1003
+  - lazy_loader=0.3=pyhd8ed1ab_0
+  - lcms2=2.16=hb7c19ff_0
+  - ld_impl_linux-64=2.40=h41732ed_0
+  - lerc=4.0.0=h27087fc_0
+  - libabseil=20240116.1=cxx17_h59595ed_2
+  - libass=0.17.1=h8fe9dca_1
+  - libblas=3.9.0=21_linux64_openblas
+  - libbrotlicommon=1.1.0=hd590300_1
+  - libbrotlidec=1.1.0=hd590300_1
+  - libbrotlienc=1.1.0=hd590300_1
+  - libcblas=3.9.0=21_linux64_openblas
+  - libclang-cpp15=15.0.7=default_hb11cfb5_4
+  - libdeflate=1.19=hd590300_0
+  - libdrm=2.4.120=hd590300_0
+  - libedit=3.1.20191231=he28a2e2_2
+  - libexpat=2.6.2=h59595ed_0
+  - libffi=3.4.2=h7f98852_5
+  - libflac=1.4.3=h59595ed_0
+  - libgcc-ng=13.2.0=h807b86a_5
+  - libgd=2.3.3=h119a65a_9
+  - libgfortran-ng=13.2.0=h69a702a_5
+  - libgfortran5=13.2.0=ha4646dd_5
+  - libglib=2.80.0=hf2295e7_0
+  - libgomp=13.2.0=h807b86a_5
+  - libhwloc=2.9.3=default_h554bfaf_1009
+  - libiconv=1.17=hd590300_2
+  - libidn2=2.3.7=hd590300_0
+  - libjpeg-turbo=3.0.0=hd590300_1
+  - liblapack=3.9.0=21_linux64_openblas
+  - liblapacke=3.9.0=21_linux64_openblas
+  - libllvm14=14.0.6=hcd5def8_4
+  - libllvm15=15.0.7=hb3ce162_4
+  - libllvmspirv15=15.0.0=h0cdce71_1
+  - libnsl=2.0.1=hd590300_0
+  - libogg=1.3.4=h7f98852_1
+  - libopenblas=0.3.26=pthreads_h413a1c8_0
+  - libopenvino=2024.0.0=h2e90f83_1
+  - libopenvino-auto-batch-plugin=2024.0.0=hd5fc58b_1
+  - libopenvino-auto-plugin=2024.0.0=hd5fc58b_1
+  - libopenvino-hetero-plugin=2024.0.0=h3ecfda7_1
+  - libopenvino-intel-cpu-plugin=2024.0.0=h2e90f83_1
+  - libopenvino-intel-gpu-plugin=2024.0.0=h2e90f83_1
+  - libopenvino-ir-frontend=2024.0.0=h3ecfda7_1
+  - libopenvino-onnx-frontend=2024.0.0=h757c851_1
+  - libopenvino-paddle-frontend=2024.0.0=h757c851_1
+  - libopenvino-pytorch-frontend=2024.0.0=h59595ed_1
+  - libopenvino-tensorflow-frontend=2024.0.0=hca94c1a_1
+  - libopenvino-tensorflow-lite-frontend=2024.0.0=h59595ed_1
+  - libopus=1.3.1=h7f98852_1
+  - libpciaccess=0.18=hd590300_0
+  - libpng=1.6.43=h2797004_0
+  - libpq=16.2=h33b98f1_0
+  - libprotobuf=4.25.3=h08a7969_0
+  - librosa=0.10.1=pyhd8ed1ab_0
+  - librsvg=2.56.3=he3f83f7_1
+  - libsndfile=1.2.2=hc60ed4a_1
+  - libsodium=1.0.18=h36c2ea0_1
+  - libsqlite=3.45.2=h2797004_0
+  - libstdcxx-ng=13.2.0=h7e041cc_5
+  - libtasn1=4.19.0=h166bdaf_0
+  - libtiff=4.6.0=ha9c0a0a_2
+  - libunistring=0.9.10=h7f98852_0
+  - libuuid=2.38.1=h0b41bf4_0
+  - libva=2.21.0=hd590300_0
+  - libvorbis=1.3.7=h9c3ff4c_0
+  - libvpx=1.14.0=h59595ed_0
+  - libwebp=1.3.2=h658648e_1
+  - libwebp-base=1.3.2=hd590300_0
+  - libxcb=1.15=h0b41bf4_0
+  - libxcrypt=4.4.36=hd590300_1
+  - libxml2=2.12.5=h232c23b_0
+  - libzlib=1.2.13=hd590300_5
+  - llvm-spirv-15=15.0.0=h0cdce71_1
+  - mad=0.15.1b=h9c3ff4c_1
+  - markdown-it-py=3.0.0=pyhd8ed1ab_0
+  - matplotlib-base=3.8.3=py39he9076e7_0
+  - matplotlib-inline=0.1.6=pyhd8ed1ab_0
+  - mdurl=0.1.2=pyhd8ed1ab_0
+  - montreal-forced-aligner=2.2.17=pyhd8ed1ab_0
+  - mpg123=1.32.4=h59595ed_0
+  - msgpack-python=1.0.7=py39h7633fee_0
+  - munkres=1.1.4=pyh9f0ad1d_0
+  - ncurses=6.4=h59595ed_2
+  - nest-asyncio=1.6.0=pyhd8ed1ab_0
+  - nettle=3.9.1=h7ab15ed_0
+  - ngram=1.3.14=h924138e_2
+  - numba=0.59.0=py39h615d6bd_1
+  - numpy=1.26.4=py39h474f0d3_0
+  - ocl-icd=2.3.2=hd590300_0
+  - openfst=1.8.2=h924138e_2
+  - openh264=2.4.1=h59595ed_0
+  - openjpeg=2.5.2=h488ebb8_0
+  - openssl=3.2.1=hd590300_0
+  - p11-kit=0.24.1=hc5aa10d_0
+  - packaging=24.0=pyhd8ed1ab_0
+  - pandas=2.2.1=py39hddac248_0
+  - pango=1.52.1=ha41ecd1_0
+  - parso=0.8.3=pyhd8ed1ab_0
+  - patsy=0.5.6=pyhd8ed1ab_0
+  - pcre2=10.43=hcad00b1_0
+  - pexpect=4.9.0=pyhd8ed1ab_0
+  - pgvector-python=0.2.5=pyhe093146_0
+  - pickleshare=0.7.5=py_1003
+  - pillow=10.2.0=py39had0adad_0
+  - pip=24.0=pyhd8ed1ab_0
+  - pixman=0.43.2=h59595ed_0
+  - platformdirs=4.2.0=pyhd8ed1ab_0
+  - pocl=5.0=h03a6ac1_2
+  - pocl-core=5.0=hdaecddf_2
+  - pocl-cpu=5.0=he901f76_2
+  - pocl-cpu-minimal=5.0=h5ccd973_2
+  - pocl-cuda=5.0=hdaecddf_2
+  - pocl-remote=5.0=h5ccd973_2
+  - pooch=1.8.1=pyhd8ed1ab_0
+  - postgresql=16.2=h7387d8b_0
+  - prompt-toolkit=3.0.42=pyha770c72_0
+  - prompt_toolkit=3.0.42=hd8ed1ab_0
+  - psutil=5.9.8=py39hd1e30aa_0
+  - psycopg2=2.9.9=py39h89197e3_0
+  - pthread-stubs=0.4=h36c2ea0_1001
+  - ptyprocess=0.7.0=pyhd3deb0d_0
+  - pugixml=1.14=h59595ed_0
+  - pure_eval=0.2.2=pyhd8ed1ab_0
+  - pycparser=2.21=pyhd8ed1ab_0
+  - pygments=2.17.2=pyhd8ed1ab_0
+  - pyparsing=3.1.2=pyhd8ed1ab_0
+  - pysocks=1.7.1=pyha2e5f31_6
+  - pysoundfile=0.12.1=pypyhd8ed1ab_1
+  - python=3.9.18=h0755675_1_cpython
+  - python-tzdata=2024.1=pyhd8ed1ab_0
+  - python_abi=3.9=4_cp39
+  - pytz=2024.1=pyhd8ed1ab_0
+  - pyyaml=6.0.1=py39hd1e30aa_1
+  - pyzmq=25.1.2=py39h8c080ef_0
+  - readline=8.2=h8228510_1
+  - requests=2.31.0=pyhd8ed1ab_0
+  - rich=13.7.1=pyhd8ed1ab_0
+  - rich-click=1.7.4=pyhd8ed1ab_0
+  - scikit-learn=1.2.2=py39hc236052_2
+  - scipy=1.12.0=py39h474f0d3_2
+  - seaborn=0.13.2=hd8ed1ab_0
+  - seaborn-base=0.13.2=pyhd8ed1ab_0
+  - setuptools=69.2.0=pyhd8ed1ab_0
+  - six=1.16.0=pyh6c4a22f_0
+  - snappy=1.1.10=h9fff704_0
+  - sox=14.4.2=ha5cc309_1018
+  - soxr=0.1.3=h0b41bf4_3
+  - soxr-python=0.3.7=py39h44dd56e_0
+  - sqlalchemy=2.0.28=py39hd1e30aa_0
+  - sqlite=3.45.2=h2c6b66d_0
+  - stack_data=0.6.2=pyhd8ed1ab_0
+  - statsmodels=0.14.1=py39h44dd56e_0
+  - svt-av1=1.8.0=h59595ed_0
+  - tbb=2021.11.0=h00ab1b0_1
+  - threadpoolctl=3.3.0=pyhc1e730c_0
+  - tk=8.6.13=noxft_h4845f30_101
+  - tornado=6.4=py39hd1e30aa_0
+  - tqdm=4.66.2=pyhd8ed1ab_0
+  - traitlets=5.14.2=pyhd8ed1ab_0
+  - typing-extensions=4.10.0=hd8ed1ab_0
+  - typing_extensions=4.10.0=pyha770c72_0
+  - tzcode=2024a=h3f72095_0
+  - tzdata=2024a=h0c530f3_0
+  - unicodedata2=15.1.0=py39hd1e30aa_0
+  - urllib3=2.2.1=pyhd8ed1ab_0
+  - wcwidth=0.2.13=pyhd8ed1ab_0
+  - wheel=0.42.0=pyhd8ed1ab_0
+  - x264=1!164.3095=h166bdaf_2
+  - x265=3.5=h924138e_3
+  - xorg-fixesproto=5.0=h7f98852_1002
+  - xorg-kbproto=1.0.7=h7f98852_1002
+  - xorg-libice=1.1.1=hd590300_0
+  - xorg-libsm=1.2.4=h7391055_0
+  - xorg-libx11=1.8.7=h8ee46fc_0
+  - xorg-libxau=1.0.11=hd590300_0
+  - xorg-libxdmcp=1.1.3=h7f98852_0
+  - xorg-libxext=1.3.4=h0b41bf4_2
+  - xorg-libxfixes=5.0.3=h7f98852_1004
+  - xorg-libxrender=0.9.11=hd590300_0
+  - xorg-renderproto=0.11.1=h7f98852_1002
+  - xorg-xextproto=7.3.0=h0b41bf4_1003
+  - xorg-xproto=7.0.31=h7f98852_1007
+  - xz=5.2.6=h166bdaf_0
+  - yaml=0.2.5=h7f98852_2
+  - zeromq=4.3.5=h59595ed_1
+  - zipp=3.17.0=pyhd8ed1ab_0
+  - zlib=1.2.13=hd590300_5
+  - zstd=1.5.5=hfc55251_0
+  - pip:
+      - absl-py==2.1.0
+      - aiofiles==23.2.1
+      - aiohttp==3.9.3
+      - aiosignal==1.3.1
+      - altair==5.2.0
+      - antlr4-python3-runtime==4.9.3
+      - anyio==4.3.0
+      - async-timeout==4.0.3
+      - attrs==23.2.0
+      - av==11.0.0
+      - babel==2.14.0
+      - beautifulsoup4==4.12.3
+      - bibtexparser==2.0.0b7
+      - bleach==6.1.0
+      - blis==0.7.11
+      - catalogue==2.0.10
+      - clldutils==3.22.2
+      - cloudpickle==3.0.0
+      - cmake==3.28.3
+      - colorlog==6.8.2
+      - confection==0.1.4
+      - csvw==3.3.0
+      - cymem==2.0.8
+      - cython==0.29.37
+      - datasets==2.16.0
+      - defusedxml==0.7.1
+      - demucs==4.0.1
+      - dill==0.3.6
+      - dlinfo==1.2.1
+      - docopt==0.6.2
+      - dora-search==0.1.12
+      - einops==0.7.0
+      - encodec==0.1.1
+      - exceptiongroup==1.2.0
+      - fastapi==0.110.0
+      - fastjsonschema==2.19.1
+      - ffmpy==0.3.2
+      - filelock==3.13.1
+      - flashy==0.0.2
+      - frozenlist==1.4.1
+      - fsspec==2023.10.0
+      - gradio==3.50.2
+      - gradio-client==0.6.1
+      - grpcio==1.62.1
+      - h11==0.14.0
+      - httpcore==1.0.4
+      - httpx==0.27.0
+      - huggingface-hub==0.21.4
+      - hydra-colorlog==1.2.0
+      - hydra-core==1.3.2
+      - ipython==8.12.3
+      - isodate==0.6.1
+      - jinja2==3.1.3
+      - jsonschema==4.21.1
+      - jsonschema-specifications==2023.12.1
+      - julius==0.2.7
+      - jupyterlab-pygments==0.3.0
+      - lameenc==1.7.0
+      - langcodes==3.3.0
+      - language-tags==1.2.0
+      - lit==18.1.1
+      - llvmlite==0.42.0
+      - lxml==5.1.0
+      - markdown==3.5.2
+      - markupsafe==2.1.5
+      - mistune==3.0.2
+      - mpmath==1.3.0
+      - msgpack==1.0.8
+      - multidict==6.0.5
+      - multiprocess==0.70.14
+      - murmurhash==1.0.10
+      - nbclient==0.10.0
+      - nbconvert==7.16.3
+      - nbformat==5.10.3
+      - networkx==3.2.1
+      - num2words==0.5.13
+      - nvidia-cublas-cu11==11.10.3.66
+      - nvidia-cuda-cupti-cu11==11.7.101
+      - nvidia-cuda-nvrtc-cu11==11.7.99
+      - nvidia-cuda-runtime-cu11==11.7.99
+      - nvidia-cudnn-cu11==8.5.0.96
+      - nvidia-cufft-cu11==10.9.0.58
+      - nvidia-curand-cu11==10.2.10.91
+      - nvidia-cusolver-cu11==11.4.0.1
+      - nvidia-cusparse-cu11==11.7.4.91
+      - nvidia-nccl-cu11==2.14.3
+      - nvidia-nvtx-cu11==11.7.91
+      - omegaconf==2.3.0
+      - openunmix==1.2.1
+      - orjson==3.9.15
+      - pandocfilters==1.5.1
+      - pathlib-abc==0.1.1
+      - pathy==0.11.0
+      - pgvector==0.2.2
+      - phonemizer==3.2.1
+      - pipreqs==0.5.0
+      - praatio==6.2.0
+      - preshed==3.0.9
+      - protobuf==4.25.3
+      - pyarrow==15.0.2
+      - pyarrow-hotfix==0.6
+      - pydantic==1.10.14
+      - pydub==0.25.1
+      - pylatexenc==2.10
+      - pynini==2.1.6
+      - pypinyin==0.48.0
+      - python-dateutil==2.9.0.post0
+      - python-multipart==0.0.9
+      - rdflib==7.0.0
+      - referencing==0.33.0
+      - regex==2023.12.25
+      - responses==0.18.0
+      - retrying==1.3.4
+      - rfc3986==1.5.0
+      - rpds-py==0.18.0
+      - safetensors==0.4.2
+      - segments==2.2.1
+      - semantic-version==2.10.0
+      - sentencepiece==0.2.0
+      - smart-open==6.4.0
+      - sniffio==1.3.1
+      - soupsieve==2.5
+      - spacy==3.5.2
+      - spacy-legacy==3.0.12
+      - spacy-loggers==1.0.5
+      - srsly==2.4.8
+      - starlette==0.36.3
+      - submitit==1.5.1
+      - sympy==1.12
+      - tabulate==0.9.0
+      - tensorboard==2.16.2
+      - tensorboard-data-server==0.7.2
+      - thinc==8.1.12
+      - tinycss2==1.2.1
+      - tokenizers==0.15.2
+      - toolz==0.12.1
+      - torch==2.0.1
+      - torchaudio==2.0.2
+      - torchmetrics==0.11.1
+      - transformers==4.38.2
+      - treetable==0.2.5
+      - triton==2.0.0
+      - typer==0.7.0
+      - uritemplate==4.1.1
+      - uvicorn==0.28.0
+      - wasabi==1.1.2
+      - webencodings==0.5.1
+      - websockets==11.0.3
+      - werkzeug==3.0.1
+      - xformers==0.0.22
+      - xxhash==3.4.1
+      - yarg==0.1.9
+      - yarl==1.9.4
+prefix: /home/pyp/miniconda3/envs/voicecraft
--- a/models/voicecraft.py
+++ b/models/voicecraft.py
@ -504,7 +504,7 @@ class VoiceCraft(nn.Module):
        ntokens = []
        top10acc = []
        for k, (logit, target) in enumerate(zip(logits, targets)):
-            loss.append(F.cross_entropy(logit, target, reduction='mean', weight=self.class_weight.data if self.args.eog_weight!=1 else None))
+            loss.append(F.cross_entropy(logit, target, reduction='mean'))
            top10acc.append(self.accuracy_metrics[k](logit.detach(), target))
            ntokens.append(len(logit))
        
@ -988,6 +988,8 @@ class VoiceCraft(nn.Module):
                for jj in range(1,self.args.n_codebooks):
                    logits_adjust[jj][eog_inference] = -10000
                    logits_adjust[jj][self.args.empty_token] = -10000
+                if cur_num_gen <= self.args.encodec_sr // 5: # this shouldn't happen, but just in case the model stopped too early
+                    logits_adjust[0][eog_inference] = -10000
                ##################### silence repetition handling #####################
                if stop_repetition > 0 and prev_token in silence_tokens and consec_silence_count > stop_repetition:
                    if logits_adjust[0, prev_token] < 0:
@ -1237,6 +1239,8 @@ class VoiceCraft(nn.Module):
                for jj in range(1,self.args.n_codebooks):
                    logits_adjust[:,jj,eog_inference] = -10000
                    logits_adjust[:,jj,self.args.empty_token] = -10000
+                if cur_num_gen <= self.args.encodec_sr // 5: # this shouldn't happen, but just in case the model stopped too early
+                    logits_adjust[:,:,eog_inference] = -10000
                ##################### silence repetition handling #####################
                for b in range(batch_size):
                    prev_token = prev_tokens[b]
--- a/z_scripts/e830M.sh
+++ b/z_scripts/e830M.sh
@ -7,9 +7,9 @@ export WORLD_SIZE=4
 dataset=gigaspeech
 mkdir -p ./logs/${dataset}

-exp_root="/data/scratch/pyp/exp_pyp/VoiceCraft"
+exp_root="path/to/store/exp_results"
 exp_name=e830M
-dataset_dir="/data/scratch/pyp/datasets/gigaspeech_phn_enc_manifest/xl"
+dataset_dir="path/to/stored_extracted_codes_and_phonemes/xl" # xs if you only extracted xs in previous step
 encodec_codes_folder_name="encodec_16khz_4codebooks"

 # export CUDA_LAUNCH_BLOCKING=1 # for debugging
@ -51,7 +51,7 @@ torchrun --nnodes=1 --rdzv-backend=c10d --rdzv-endpoint=localhost:41977 --nproc_
 --text_vocab_size 100 \
 --text_pad_token 100 \
 --phn_folder_name "phonemes" \
--manifest_name "manifest_large16khz_lessambi" \
+--manifest_name "manifest" \
 --encodec_folder_name ${encodec_codes_folder_name} \
 --audio_vocab_size 2048 \
 --empty_token 2048 \