Table of Contents

Class ClipTokenizerFactory

Namespace
AiDotNet.Tokenization
Assembly
AiDotNet.dll

Factory for creating CLIP-compatible tokenizers.

public static class ClipTokenizerFactory
Inheritance
ClipTokenizerFactory
Inherited Members

Remarks

CLIP uses a BPE tokenizer with a vocabulary of 49408 tokens. This factory provides methods to create tokenizers from pretrained vocabulary files or to use a default configuration for testing.

For Beginners: CLIP needs a special tokenizer to break text into pieces.

A tokenizer factory is like a tool shop that builds tokenizers:

  1. You can load a pretrained tokenizer (recommended for production)
  2. You can create a simple tokenizer for testing
  3. The factory handles all the configuration details

Example usage:

// Load from pretrained files (recommended)
var tokenizer = ClipTokenizerFactory.FromPretrained(
    "path/to/vocab.json",
    "path/to/merges.txt"
);

// Or create a simple one for testing
var tokenizer = ClipTokenizerFactory.CreateSimple();

Fields

ClipPattern

The CLIP-specific pre-tokenization pattern.

public const string ClipPattern = "<\\|startoftext\\|>|<\\|endoftext\\|>|'s|'t|'re|'ve|'m|'ll|'d|[\\p{L}]+|[\\p{N}]|[^\\s\\p{L}\\p{N}]+"

Field Value

string

Remarks

This pattern is similar to GPT-2 but handles lowercase conversion and special handling of punctuation that CLIP expects.

DefaultMaxLength

The default maximum sequence length for CLIP text encoder.

public const int DefaultMaxLength = 77

Field Value

int

DefaultVocabSize

The default vocabulary size for CLIP models.

public const int DefaultVocabSize = 49408

Field Value

int

Methods

CreateSimple(IEnumerable<string>?, int)

Creates a simple CLIP tokenizer for testing without pretrained files.

public static BpeTokenizer CreateSimple(IEnumerable<string>? corpus = null, int vocabSize = 1000)

Parameters

corpus IEnumerable<string>

Optional corpus to train on. If null, uses a minimal configuration.

vocabSize int

The vocabulary size to train. Default is 1000 for quick testing.

Returns

BpeTokenizer

A CLIP-style BPE tokenizer.

Remarks

This creates a minimal tokenizer suitable for testing and development. For production use, always use FromPretrained(string, string) with the actual CLIP vocabulary files.

For Beginners: Use this for quick testing only!

This tokenizer won't produce the same results as the real CLIP tokenizer. It's only meant for:

  • Unit testing
  • Development and debugging
  • Understanding how the tokenizer works

For real applications, always use the pretrained vocabulary files.

FromPretrained(string, string)

Creates a CLIP tokenizer from pretrained vocabulary and merge files.

public static BpeTokenizer FromPretrained(string vocabPath, string mergesPath)

Parameters

vocabPath string

Path to the vocabulary JSON file (vocab.json).

mergesPath string

Path to the merges text file (merges.txt).

Returns

BpeTokenizer

A CLIP-compatible BPE tokenizer.

Remarks

The vocabulary file should be a JSON dictionary mapping tokens to IDs. The merges file should contain BPE merge rules, one per line.

For Beginners: To use CLIP's pretrained tokenizer:

  1. Download the vocabulary files from HuggingFace:

    • openai/clip-vit-base-patch32
    • Files: vocab.json, merges.txt
  2. Load them using this method:

    var tokenizer = ClipTokenizerFactory.FromPretrained(
        "vocab.json",
        "merges.txt"
    );
  3. Use the tokenizer:

    var result = tokenizer.Encode("a photo of a cat");
    // result.TokenIds: [49406, 320, 1125, 539, 320, 2368, 49407, ...]

GetDefaultEncodingOptions(int)

Gets the default encoding options for CLIP text encoding.

public static EncodingOptions GetDefaultEncodingOptions(int maxLength = 77)

Parameters

maxLength int

The maximum sequence length. Default is 77.

Returns

EncodingOptions

Encoding options configured for CLIP.

Remarks

CLIP expects text to be: - Padded to exactly 77 tokens (or truncated if longer) - Starting with the BOS token - Ending with the EOS token

IsClipCompatible(ITokenizer)

Validates that a tokenizer is compatible with CLIP.

public static bool IsClipCompatible(ITokenizer tokenizer)

Parameters

tokenizer ITokenizer

The tokenizer to validate.

Returns

bool

True if the tokenizer is compatible, false otherwise.

Remarks

A CLIP-compatible tokenizer must: - Have the correct special tokens - Support the expected vocabulary size (close to 49408)