Table of Contents

Class TokenizationConfig

Namespace
AiDotNet.Tokenization.Configuration
Assembly
AiDotNet.dll

Configuration options for tokenization in the prediction pipeline.

public class TokenizationConfig
Inheritance
TokenizationConfig
Inherited Members

Remarks

For Beginners: Tokenization is the process of breaking text into smaller pieces (tokens) that a machine learning model can understand. Think of it like breaking a sentence into words, but sometimes words are further broken into subwords for better handling of unknown words.

Properties

AddSpecialTokens

Gets or sets whether to automatically add special tokens (like [CLS], [SEP]) during encoding. Default is true.

public bool AddSpecialTokens { get; set; }

Property Value

bool

DefaultEncodingOptions

Gets or sets the default encoding options for tokenization.

public EncodingOptions DefaultEncodingOptions { get; set; }

Property Value

EncodingOptions

EnableCaching

Gets or sets whether to cache tokenization results for repeated inputs. Default is false.

public bool EnableCaching { get; set; }

Property Value

bool

EnableParallelBatchProcessing

Gets or sets whether to use parallel processing for batch tokenization. Default is true for batches larger than the threshold.

public bool EnableParallelBatchProcessing { get; set; }

Property Value

bool

MaxLength

Gets or sets the maximum sequence length for tokenization. Sequences longer than this will be truncated.

public int? MaxLength { get; set; }

Property Value

int?

Padding

Gets or sets whether to pad sequences to the maximum length.

public bool Padding { get; set; }

Property Value

bool

PaddingSide

Gets or sets the side on which to pad sequences ("left" or "right"). Default is "right".

public string PaddingSide { get; set; }

Property Value

string

ParallelBatchThreshold

Gets or sets the minimum batch size to trigger parallel processing. Default is 32.

public int ParallelBatchThreshold { get; set; }

Property Value

int

ReturnAttentionMask

Gets or sets whether to return attention masks. Default is true.

public bool ReturnAttentionMask { get; set; }

Property Value

bool

ReturnTokenTypeIds

Gets or sets whether to return token type IDs (for models like BERT with multiple segments). Default is false.

public bool ReturnTokenTypeIds { get; set; }

Property Value

bool

Truncation

Gets or sets whether to truncate sequences that exceed max length.

public bool Truncation { get; set; }

Property Value

bool

TruncationSide

Gets or sets the side on which to truncate sequences ("left" or "right"). Default is "right".

public string TruncationSide { get; set; }

Property Value

string

Methods

ForBert(int)

Creates a configuration suitable for BERT-style models.

public static TokenizationConfig ForBert(int maxLength = 512)

Parameters

maxLength int

Returns

TokenizationConfig

ForCode(int)

Creates a configuration suitable for code tokenization.

public static TokenizationConfig ForCode(int maxLength = 2048)

Parameters

maxLength int

Returns

TokenizationConfig

ForGpt(int)

Creates a configuration suitable for GPT-style models.

public static TokenizationConfig ForGpt(int maxLength = 1024)

Parameters

maxLength int

Returns

TokenizationConfig