Class TokenizationConfig
- Namespace
- AiDotNet.Tokenization.Configuration
- Assembly
- AiDotNet.dll
Configuration options for tokenization in the prediction pipeline.
public class TokenizationConfig
- Inheritance
-
TokenizationConfig
- Inherited Members
Remarks
For Beginners: Tokenization is the process of breaking text into smaller pieces (tokens) that a machine learning model can understand. Think of it like breaking a sentence into words, but sometimes words are further broken into subwords for better handling of unknown words.
Properties
AddSpecialTokens
Gets or sets whether to automatically add special tokens (like [CLS], [SEP]) during encoding. Default is true.
public bool AddSpecialTokens { get; set; }
Property Value
DefaultEncodingOptions
Gets or sets the default encoding options for tokenization.
public EncodingOptions DefaultEncodingOptions { get; set; }
Property Value
EnableCaching
Gets or sets whether to cache tokenization results for repeated inputs. Default is false.
public bool EnableCaching { get; set; }
Property Value
EnableParallelBatchProcessing
Gets or sets whether to use parallel processing for batch tokenization. Default is true for batches larger than the threshold.
public bool EnableParallelBatchProcessing { get; set; }
Property Value
MaxLength
Gets or sets the maximum sequence length for tokenization. Sequences longer than this will be truncated.
public int? MaxLength { get; set; }
Property Value
- int?
Padding
Gets or sets whether to pad sequences to the maximum length.
public bool Padding { get; set; }
Property Value
PaddingSide
Gets or sets the side on which to pad sequences ("left" or "right"). Default is "right".
public string PaddingSide { get; set; }
Property Value
ParallelBatchThreshold
Gets or sets the minimum batch size to trigger parallel processing. Default is 32.
public int ParallelBatchThreshold { get; set; }
Property Value
ReturnAttentionMask
Gets or sets whether to return attention masks. Default is true.
public bool ReturnAttentionMask { get; set; }
Property Value
ReturnTokenTypeIds
Gets or sets whether to return token type IDs (for models like BERT with multiple segments). Default is false.
public bool ReturnTokenTypeIds { get; set; }
Property Value
Truncation
Gets or sets whether to truncate sequences that exceed max length.
public bool Truncation { get; set; }
Property Value
TruncationSide
Gets or sets the side on which to truncate sequences ("left" or "right"). Default is "right".
public string TruncationSide { get; set; }
Property Value
Methods
ForBert(int)
Creates a configuration suitable for BERT-style models.
public static TokenizationConfig ForBert(int maxLength = 512)
Parameters
maxLengthint
Returns
ForCode(int)
Creates a configuration suitable for code tokenization.
public static TokenizationConfig ForCode(int maxLength = 2048)
Parameters
maxLengthint
Returns
ForGpt(int)
Creates a configuration suitable for GPT-style models.
public static TokenizationConfig ForGpt(int maxLength = 1024)
Parameters
maxLengthint