Table of Contents

Class TokenizerBase

Namespace
AiDotNet.Tokenization.Core
Assembly
AiDotNet.dll

Base class for tokenizers providing common functionality.

public abstract class TokenizerBase : ITokenizer
Inheritance
TokenizerBase
Implements
Derived
Inherited Members

Constructors

TokenizerBase(IVocabulary, SpecialTokens)

Initializes a new instance of the TokenizerBase class.

protected TokenizerBase(IVocabulary vocabulary, SpecialTokens specialTokens)

Parameters

vocabulary IVocabulary
specialTokens SpecialTokens

Properties

SpecialTokens

Gets the special tokens.

public SpecialTokens SpecialTokens { get; protected set; }

Property Value

SpecialTokens

Vocabulary

Gets the vocabulary.

public IVocabulary Vocabulary { get; protected set; }

Property Value

IVocabulary

VocabularySize

Gets the vocabulary size.

public int VocabularySize { get; }

Property Value

int

Methods

AddSpecialTokensToSequence(List<string>)

Adds special tokens to a sequence.

protected virtual List<string> AddSpecialTokensToSequence(List<string> tokens)

Parameters

tokens List<string>

Returns

List<string>

CleanupTokens(List<string>)

Cleans up tokens and converts them back to text (must be implemented by derived classes).

protected abstract string CleanupTokens(List<string> tokens)

Parameters

tokens List<string>

Returns

string

ConvertIdsToTokens(List<int>)

Converts token IDs to tokens.

public virtual List<string> ConvertIdsToTokens(List<int> ids)

Parameters

ids List<int>

Returns

List<string>

ConvertTokensToIds(List<string>)

Converts tokens to token IDs.

public virtual List<int> ConvertTokensToIds(List<string> tokens)

Parameters

tokens List<string>

Returns

List<int>

Decode(List<int>, bool)

Decodes token IDs back into text.

public virtual string Decode(List<int> tokenIds, bool skipSpecialTokens = true)

Parameters

tokenIds List<int>
skipSpecialTokens bool

Returns

string

DecodeBatch(List<List<int>>, bool)

Decodes multiple sequences of token IDs back into text.

public virtual List<string> DecodeBatch(List<List<int>> tokenIdsBatch, bool skipSpecialTokens = true)

Parameters

tokenIdsBatch List<List<int>>
skipSpecialTokens bool

Returns

List<string>

Encode(string, EncodingOptions?)

Encodes text into tokens.

public virtual TokenizationResult Encode(string text, EncodingOptions? options = null)

Parameters

text string
options EncodingOptions

Returns

TokenizationResult

EncodeBatch(List<string>, EncodingOptions?)

Encodes multiple texts into tokens.

public virtual List<TokenizationResult> EncodeBatch(List<string> texts, EncodingOptions? options = null)

Parameters

texts List<string>
options EncodingOptions

Returns

List<TokenizationResult>

Tokenize(string)

Tokenizes text into subword tokens (must be implemented by derived classes).

public abstract List<string> Tokenize(string text)

Parameters

text string

Returns

List<string>

TruncateSequence(List<string>, int, string)

Truncates a sequence to a maximum length.

protected virtual List<string> TruncateSequence(List<string> tokens, int maxLength, string side)

Parameters

tokens List<string>
maxLength int
side string

Returns

List<string>