Read config.json and enable exllama loading if the model has a
`quantization_config` with `quant_methdod` of `gptq`. Note that this
implementation is limited and only supports model.safetensors.
That said, this supports loading popular gptq quantized models
without renaming or symlinking the model file.
The bos token was already hardcoded as a bad word id.
Store badwords in a list and iterate over them during generation.
Add the Llama eos token to the list of bad words.
Also support "single line mode", which adds newline (13) to badwords.
There is a bug in PyTorch 2.0.1 that allows torch.multinomial to
sometimes choose elements that have zero probability. Since
this is uncommon we can continue to use torch.multinomial as
long as we verify that the results are valid. If they aren't,
try again until the probability of each selected token is positive.
The end-of-sequence (</s>) token indicates the end of a generation.
When a token sequence containing </s> is decoded, an extra (wrong)
space is inserted at the beginning of the generation. To avoid this,
strip the eos token out of the result before returning it.
The eos token was getting stripped later, so this doesn't change
the output except to avoid the spurious leading space.