Interface ITokenizer
- Namespace
- AiDotNet.Tokenization.Interfaces
- Assembly
- AiDotNet.dll
Interface for text tokenizers.
public interface ITokenizer
Properties
SpecialTokens
Gets the special tokens.
SpecialTokens SpecialTokens { get; }
Property Value
Vocabulary
Gets the vocabulary.
IVocabulary Vocabulary { get; }
Property Value
VocabularySize
Gets the vocabulary size.
int VocabularySize { get; }
Property Value
Methods
ConvertIdsToTokens(List<int>)
Converts token IDs to tokens.
List<string> ConvertIdsToTokens(List<int> ids)
Parameters
Returns
ConvertTokensToIds(List<string>)
Converts tokens to token IDs.
List<int> ConvertTokensToIds(List<string> tokens)
Parameters
Returns
Decode(List<int>, bool)
Decodes token IDs back into text.
string Decode(List<int> tokenIds, bool skipSpecialTokens = true)
Parameters
tokenIdsList<int>The token IDs to decode.
skipSpecialTokensboolWhether to skip special tokens in the output.
Returns
- string
The decoded text.
DecodeBatch(List<List<int>>, bool)
Decodes multiple sequences of token IDs back into text.
List<string> DecodeBatch(List<List<int>> tokenIdsBatch, bool skipSpecialTokens = true)
Parameters
tokenIdsBatchList<List<int>>The batch of token IDs to decode.
skipSpecialTokensboolWhether to skip special tokens in the output.
Returns
Encode(string, EncodingOptions?)
Encodes text into tokens.
TokenizationResult Encode(string text, EncodingOptions? options = null)
Parameters
textstringThe text to encode.
optionsEncodingOptionsEncoding options.
Returns
- TokenizationResult
The tokenization result.
EncodeBatch(List<string>, EncodingOptions?)
Encodes multiple texts into tokens.
List<TokenizationResult> EncodeBatch(List<string> texts, EncodingOptions? options = null)
Parameters
textsList<string>The texts to encode.
optionsEncodingOptionsEncoding options.
Returns
- List<TokenizationResult>
The tokenization results.
Tokenize(string)
Tokenizes text into subword tokens (without converting to IDs).
List<string> Tokenize(string text)
Parameters
textstringThe text to tokenize.