Class ClipTokenizerFactory

Namespace: AiDotNet.Tokenization

Assembly: AiDotNet.dll

Factory for creating CLIP-compatible tokenizers.

public static class ClipTokenizerFactory

Inheritance: object

ClipTokenizerFactory

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

CLIP uses a BPE tokenizer with a vocabulary of 49408 tokens. This factory provides methods to create tokenizers from pretrained vocabulary files or to use a default configuration for testing.

For Beginners: CLIP needs a special tokenizer to break text into pieces.

A tokenizer factory is like a tool shop that builds tokenizers:

You can load a pretrained tokenizer (recommended for production)
You can create a simple tokenizer for testing
The factory handles all the configuration details

Example usage:

// Load from pretrained files (recommended)
var tokenizer = ClipTokenizerFactory.FromPretrained(
    "path/to/vocab.json",
    "path/to/merges.txt"
);

// Or create a simple one for testing
var tokenizer = ClipTokenizerFactory.CreateSimple();

Fields

ClipPattern

The CLIP-specific pre-tokenization pattern.

public const string ClipPattern = "<\\|startoftext\\|>|<\\|endoftext\\|>|'s|'t|'re|'ve|'m|'ll|'d|[\\p{L}]+|[\\p{N}]|[^\\s\\p{L}\\p{N}]+"

Field Value

string

Remarks

This pattern is similar to GPT-2 but handles lowercase conversion and special handling of punctuation that CLIP expects.

DefaultMaxLength

The default maximum sequence length for CLIP text encoder.

public const int DefaultMaxLength = 77

Field Value

int

DefaultVocabSize

The default vocabulary size for CLIP models.

public const int DefaultVocabSize = 49408

Field Value

int

Methods

CreateSimple(IEnumerable<string>?, int)

Creates a simple CLIP tokenizer for testing without pretrained files.

public static BpeTokenizer CreateSimple(IEnumerable<string>? corpus = null, int vocabSize = 1000)

Parameters

corpus IEnumerable<string>: Optional corpus to train on. If null, uses a minimal configuration.
vocabSize int: The vocabulary size to train. Default is 1000 for quick testing.

Returns

BpeTokenizer: A CLIP-style BPE tokenizer.

Remarks

This creates a minimal tokenizer suitable for testing and development. For production use, always use FromPretrained(string, string) with the actual CLIP vocabulary files.

For Beginners: Use this for quick testing only!

This tokenizer won't produce the same results as the real CLIP tokenizer. It's only meant for:

Unit testing
Development and debugging
Understanding how the tokenizer works

For real applications, always use the pretrained vocabulary files.

FromPretrained(string, string)

Creates a CLIP tokenizer from pretrained vocabulary and merge files.

public static BpeTokenizer FromPretrained(string vocabPath, string mergesPath)

Parameters

vocabPath string: Path to the vocabulary JSON file (vocab.json).
mergesPath string: Path to the merges text file (merges.txt).

Returns

BpeTokenizer: A CLIP-compatible BPE tokenizer.

Remarks

The vocabulary file should be a JSON dictionary mapping tokens to IDs. The merges file should contain BPE merge rules, one per line.

For Beginners: To use CLIP's pretrained tokenizer:

Download the vocabulary files from HuggingFace:
- openai/clip-vit-base-patch32
- Files: vocab.json, merges.txt

Load them using this method:

var tokenizer = ClipTokenizerFactory.FromPretrained(
    "vocab.json",
    "merges.txt"
);

Use the tokenizer:

var result = tokenizer.Encode("a photo of a cat");
// result.TokenIds: [49406, 320, 1125, 539, 320, 2368, 49407, ...]

GetDefaultEncodingOptions(int)

Gets the default encoding options for CLIP text encoding.

public static EncodingOptions GetDefaultEncodingOptions(int maxLength = 77)

Parameters

maxLength int: The maximum sequence length. Default is 77.

Returns

EncodingOptions: Encoding options configured for CLIP.

Remarks

CLIP expects text to be: - Padded to exactly 77 tokens (or truncated if longer) - Starting with the BOS token - Ending with the EOS token

IsClipCompatible(ITokenizer)

Validates that a tokenizer is compatible with CLIP.

public static bool IsClipCompatible(ITokenizer tokenizer)

Parameters

tokenizer ITokenizer: The tokenizer to validate.

Returns

bool: True if the tokenizer is compatible, false otherwise.

Remarks

A CLIP-compatible tokenizer must: - Have the correct special tokens - Support the expected vocabulary size (close to 49408)

Table of Contents

Class ClipTokenizerFactory

Remarks

Fields

ClipPattern

Field Value

Remarks

DefaultMaxLength

Field Value

DefaultVocabSize

Field Value

Methods

CreateSimple(IEnumerable<string>?, int)

Parameters

Returns

Remarks

FromPretrained(string, string)

Parameters

Returns

Remarks

GetDefaultEncodingOptions(int)

Parameters

Returns

Remarks

IsClipCompatible(ITokenizer)

Parameters

Returns

Remarks