Table of Contents

Class LanguageModelTokenizerFactory

Namespace
AiDotNet.Tokenization
Assembly
AiDotNet.dll

Factory for creating tokenizers appropriate for different language model backbones.

public static class LanguageModelTokenizerFactory
Inheritance
LanguageModelTokenizerFactory
Inherited Members

Remarks

Different language models use different tokenization schemes:

  • OPT, Chinchilla: GPT-style BPE tokenization
  • Flan-T5: T5-style SentencePiece tokenization
  • LLaMA, Vicuna, Mistral: LLaMA-style SentencePiece tokenization
  • Phi, Qwen: GPT-style BPE with custom vocabulary

For Beginners: Each language model was trained with a specific tokenizer. Using the wrong tokenizer will produce garbage results. This factory creates a basic tokenizer with the correct special tokens for each model type.

For production use, you should load the actual pretrained tokenizer from HuggingFace using AutoTokenizer.

Methods

CreateForBackbone(LanguageModelBackbone, IEnumerable<string>?, int)

Creates a tokenizer appropriate for the specified language model backbone.

public static ITokenizer CreateForBackbone(LanguageModelBackbone backbone, IEnumerable<string>? corpus = null, int vocabSize = 30000)

Parameters

backbone LanguageModelBackbone

The language model backbone type.

corpus IEnumerable<string>

Optional training corpus. If null, uses a minimal English corpus.

vocabSize int

Vocabulary size for training. Default is 1000 for quick testing.

Returns

ITokenizer

A tokenizer configured for the specified backbone.

Remarks

IMPORTANT: This creates a minimal tokenizer suitable for testing and development. For production use with pretrained ONNX models, you MUST load the actual pretrained tokenizer that matches your model weights.

Use FromPretrained(string, string?) to load the correct pretrained tokenizer for production use.

GetHuggingFaceModelName(LanguageModelBackbone)

Gets the recommended HuggingFace model name for loading a pretrained tokenizer.

public static string GetHuggingFaceModelName(LanguageModelBackbone backbone)

Parameters

backbone LanguageModelBackbone

The language model backbone type.

Returns

string

The HuggingFace model identifier for loading the tokenizer.

Remarks

Use this with FromPretrained(string, string?) to load the correct pretrained tokenizer.

Example:

var modelName = LanguageModelTokenizerFactory.GetHuggingFaceModelName(LanguageModelBackbone.LLaMA);
var tokenizer = AutoTokenizer.FromPretrained(modelName);

GetSpecialTokens(LanguageModelBackbone)

Gets the special tokens configuration for a language model backbone.

public static SpecialTokens GetSpecialTokens(LanguageModelBackbone backbone)

Parameters

backbone LanguageModelBackbone

The language model backbone type.

Returns

SpecialTokens

The special tokens configuration.