Add tokenizers to guidebook

This commit is contained in:
SillyLossy
2023-04-26 00:51:53 +03:00
parent 2ae28023c0
commit 7bedfae633
2 changed files with 16 additions and 1 deletions

View File

@@ -1038,7 +1038,11 @@
<div name="ContextFormatting">
<h4>Context Formatting</h4>
<div>
<h4>Tokenizer</h4>
<h4>Tokenizer
<a href="/notes#tokenizer" class="notes-link" target="_blank">
<span class="note-link-span">?</span>
</a>
</h4>
<select id="tokenizer">
<option value="0">None / Estimated</option>
<option value="1">GPT-3 (OpenAI)</option>

View File

@@ -400,6 +400,17 @@ _When using Pygmalion models these anchors are automatically disabled, since Pyg
To import Character.AI chats, use this tool: [https://github.com/0x000011b/characterai-dumper](https://github.com/0x000011b/characterai-dumper).
## Tokenizer
A tokenizer is a tool that breaks down a piece of text into smaller units called tokens. These tokens can be individual words or even parts of words, such as prefixes, suffixes, or punctuation. A rule of thumb is that one token generally corresponds to 3~4 characters of text.
SillyTavern can use the following tokenizers while forming a request to the AI backend:
1. None. Each token is estimated to be ~3.3 characters, rounded up to the nearest integer. **Try this if your prompts get cut off on high context lengths.** This approach is used by KoboldAI Lite.
2. GPT-3 tokenizer. **Use to get more accurate counts on OpenAI character cards.** Can be previewed here: [OpenAI Tokenizer](https://platform.openai.com/tokenizer).
3. (Legacy) GPT-2/3 tokenizer. Used by original TavernAI. **Pick this if you're unsure.** More info: [gpt-2-3-tokenizer](https://github.com/josephrocca/gpt-2-3-tokenizer).
4. Sentencepiece tokenizer. Used by LLaMA model family: Alpaca, Vicuna, Koala, etc. **Pick if you use a LLaMA model.**
## Advanced Formatting
The settings provided in this section allow for more control over the prompt building strategy. Most specifics of the prompt building depend on whether a Pygmalion model is selected or special formatting is force-enabled. The core differences between the formatting schemas are listed below.