Class TokenizerBase
- Namespace
- AiDotNet.Tokenization.Core
- Assembly
- AiDotNet.dll
Base class for tokenizers providing common functionality.
public abstract class TokenizerBase : ITokenizer
- Inheritance
-
TokenizerBase
- Implements
- Derived
- Inherited Members
Constructors
TokenizerBase(IVocabulary, SpecialTokens)
Initializes a new instance of the TokenizerBase class.
protected TokenizerBase(IVocabulary vocabulary, SpecialTokens specialTokens)
Parameters
vocabularyIVocabularyspecialTokensSpecialTokens
Properties
SpecialTokens
Gets the special tokens.
public SpecialTokens SpecialTokens { get; protected set; }
Property Value
Vocabulary
Gets the vocabulary.
public IVocabulary Vocabulary { get; protected set; }
Property Value
VocabularySize
Gets the vocabulary size.
public int VocabularySize { get; }
Property Value
Methods
AddSpecialTokensToSequence(List<string>)
Adds special tokens to a sequence.
protected virtual List<string> AddSpecialTokensToSequence(List<string> tokens)
Parameters
Returns
CleanupTokens(List<string>)
Cleans up tokens and converts them back to text (must be implemented by derived classes).
protected abstract string CleanupTokens(List<string> tokens)
Parameters
Returns
ConvertIdsToTokens(List<int>)
Converts token IDs to tokens.
public virtual List<string> ConvertIdsToTokens(List<int> ids)
Parameters
Returns
ConvertTokensToIds(List<string>)
Converts tokens to token IDs.
public virtual List<int> ConvertTokensToIds(List<string> tokens)
Parameters
Returns
Decode(List<int>, bool)
Decodes token IDs back into text.
public virtual string Decode(List<int> tokenIds, bool skipSpecialTokens = true)
Parameters
Returns
DecodeBatch(List<List<int>>, bool)
Decodes multiple sequences of token IDs back into text.
public virtual List<string> DecodeBatch(List<List<int>> tokenIdsBatch, bool skipSpecialTokens = true)
Parameters
Returns
Encode(string, EncodingOptions?)
Encodes text into tokens.
public virtual TokenizationResult Encode(string text, EncodingOptions? options = null)
Parameters
textstringoptionsEncodingOptions
Returns
EncodeBatch(List<string>, EncodingOptions?)
Encodes multiple texts into tokens.
public virtual List<TokenizationResult> EncodeBatch(List<string> texts, EncodingOptions? options = null)
Parameters
textsList<string>optionsEncodingOptions
Returns
Tokenize(string)
Tokenizes text into subword tokens (must be implemented by derived classes).
public abstract List<string> Tokenize(string text)
Parameters
textstring
Returns
TruncateSequence(List<string>, int, string)
Truncates a sequence to a maximum length.
protected virtual List<string> TruncateSequence(List<string> tokens, int maxLength, string side)