Class ClipTokenizerFactory
- Namespace
- AiDotNet.Tokenization
- Assembly
- AiDotNet.dll
Factory for creating CLIP-compatible tokenizers.
public static class ClipTokenizerFactory
- Inheritance
-
ClipTokenizerFactory
- Inherited Members
Remarks
CLIP uses a BPE tokenizer with a vocabulary of 49408 tokens. This factory provides methods to create tokenizers from pretrained vocabulary files or to use a default configuration for testing.
For Beginners: CLIP needs a special tokenizer to break text into pieces.
A tokenizer factory is like a tool shop that builds tokenizers:
- You can load a pretrained tokenizer (recommended for production)
- You can create a simple tokenizer for testing
- The factory handles all the configuration details
Example usage:
// Load from pretrained files (recommended)
var tokenizer = ClipTokenizerFactory.FromPretrained(
"path/to/vocab.json",
"path/to/merges.txt"
);
// Or create a simple one for testing
var tokenizer = ClipTokenizerFactory.CreateSimple();
Fields
ClipPattern
The CLIP-specific pre-tokenization pattern.
public const string ClipPattern = "<\\|startoftext\\|>|<\\|endoftext\\|>|'s|'t|'re|'ve|'m|'ll|'d|[\\p{L}]+|[\\p{N}]|[^\\s\\p{L}\\p{N}]+"
Field Value
Remarks
This pattern is similar to GPT-2 but handles lowercase conversion and special handling of punctuation that CLIP expects.
DefaultMaxLength
The default maximum sequence length for CLIP text encoder.
public const int DefaultMaxLength = 77
Field Value
DefaultVocabSize
The default vocabulary size for CLIP models.
public const int DefaultVocabSize = 49408
Field Value
Methods
CreateSimple(IEnumerable<string>?, int)
Creates a simple CLIP tokenizer for testing without pretrained files.
public static BpeTokenizer CreateSimple(IEnumerable<string>? corpus = null, int vocabSize = 1000)
Parameters
corpusIEnumerable<string>Optional corpus to train on. If null, uses a minimal configuration.
vocabSizeintThe vocabulary size to train. Default is 1000 for quick testing.
Returns
- BpeTokenizer
A CLIP-style BPE tokenizer.
Remarks
This creates a minimal tokenizer suitable for testing and development. For production use, always use FromPretrained(string, string) with the actual CLIP vocabulary files.
For Beginners: Use this for quick testing only!
This tokenizer won't produce the same results as the real CLIP tokenizer. It's only meant for:
- Unit testing
- Development and debugging
- Understanding how the tokenizer works
For real applications, always use the pretrained vocabulary files.
FromPretrained(string, string)
Creates a CLIP tokenizer from pretrained vocabulary and merge files.
public static BpeTokenizer FromPretrained(string vocabPath, string mergesPath)
Parameters
vocabPathstringPath to the vocabulary JSON file (vocab.json).
mergesPathstringPath to the merges text file (merges.txt).
Returns
- BpeTokenizer
A CLIP-compatible BPE tokenizer.
Remarks
The vocabulary file should be a JSON dictionary mapping tokens to IDs. The merges file should contain BPE merge rules, one per line.
For Beginners: To use CLIP's pretrained tokenizer:
Download the vocabulary files from HuggingFace:
- openai/clip-vit-base-patch32
- Files: vocab.json, merges.txt
Load them using this method:
var tokenizer = ClipTokenizerFactory.FromPretrained( "vocab.json", "merges.txt" );Use the tokenizer:
var result = tokenizer.Encode("a photo of a cat"); // result.TokenIds: [49406, 320, 1125, 539, 320, 2368, 49407, ...]
GetDefaultEncodingOptions(int)
Gets the default encoding options for CLIP text encoding.
public static EncodingOptions GetDefaultEncodingOptions(int maxLength = 77)
Parameters
maxLengthintThe maximum sequence length. Default is 77.
Returns
- EncodingOptions
Encoding options configured for CLIP.
Remarks
CLIP expects text to be: - Padded to exactly 77 tokens (or truncated if longer) - Starting with the BOS token - Ending with the EOS token
IsClipCompatible(ITokenizer)
Validates that a tokenizer is compatible with CLIP.
public static bool IsClipCompatible(ITokenizer tokenizer)
Parameters
tokenizerITokenizerThe tokenizer to validate.
Returns
- bool
True if the tokenizer is compatible, false otherwise.
Remarks
A CLIP-compatible tokenizer must: - Have the correct special tokens - Support the expected vocabulary size (close to 49408)