Class LanguageModelTokenizerFactory
- Namespace
- AiDotNet.Tokenization
- Assembly
- AiDotNet.dll
Factory for creating tokenizers appropriate for different language model backbones.
public static class LanguageModelTokenizerFactory
- Inheritance
-
LanguageModelTokenizerFactory
- Inherited Members
Remarks
Different language models use different tokenization schemes:
- OPT, Chinchilla: GPT-style BPE tokenization
- Flan-T5: T5-style SentencePiece tokenization
- LLaMA, Vicuna, Mistral: LLaMA-style SentencePiece tokenization
- Phi, Qwen: GPT-style BPE with custom vocabulary
For Beginners: Each language model was trained with a specific tokenizer. Using the wrong tokenizer will produce garbage results. This factory creates a basic tokenizer with the correct special tokens for each model type.
For production use, you should load the actual pretrained tokenizer from HuggingFace using AutoTokenizer.
Methods
CreateForBackbone(LanguageModelBackbone, IEnumerable<string>?, int)
Creates a tokenizer appropriate for the specified language model backbone.
public static ITokenizer CreateForBackbone(LanguageModelBackbone backbone, IEnumerable<string>? corpus = null, int vocabSize = 30000)
Parameters
backboneLanguageModelBackboneThe language model backbone type.
corpusIEnumerable<string>Optional training corpus. If null, uses a minimal English corpus.
vocabSizeintVocabulary size for training. Default is 1000 for quick testing.
Returns
- ITokenizer
A tokenizer configured for the specified backbone.
Remarks
IMPORTANT: This creates a minimal tokenizer suitable for testing and development. For production use with pretrained ONNX models, you MUST load the actual pretrained tokenizer that matches your model weights.
Use FromPretrained(string, string?) to load the correct pretrained tokenizer for production use.
GetHuggingFaceModelName(LanguageModelBackbone)
Gets the recommended HuggingFace model name for loading a pretrained tokenizer.
public static string GetHuggingFaceModelName(LanguageModelBackbone backbone)
Parameters
backboneLanguageModelBackboneThe language model backbone type.
Returns
- string
The HuggingFace model identifier for loading the tokenizer.
Remarks
Use this with FromPretrained(string, string?) to load the correct pretrained tokenizer.
Example:
var modelName = LanguageModelTokenizerFactory.GetHuggingFaceModelName(LanguageModelBackbone.LLaMA);
var tokenizer = AutoTokenizer.FromPretrained(modelName);
GetSpecialTokens(LanguageModelBackbone)
Gets the special tokens configuration for a language model backbone.
public static SpecialTokens GetSpecialTokens(LanguageModelBackbone backbone)
Parameters
backboneLanguageModelBackboneThe language model backbone type.
Returns
- SpecialTokens
The special tokens configuration.