Table of Contents

Interface ITokenizer

Namespace
AiDotNet.Tokenization.Interfaces
Assembly
AiDotNet.dll

Interface for text tokenizers.

public interface ITokenizer

Properties

SpecialTokens

Gets the special tokens.

SpecialTokens SpecialTokens { get; }

Property Value

SpecialTokens

Vocabulary

Gets the vocabulary.

IVocabulary Vocabulary { get; }

Property Value

IVocabulary

VocabularySize

Gets the vocabulary size.

int VocabularySize { get; }

Property Value

int

Methods

ConvertIdsToTokens(List<int>)

Converts token IDs to tokens.

List<string> ConvertIdsToTokens(List<int> ids)

Parameters

ids List<int>

The token IDs to convert.

Returns

List<string>

The tokens.

ConvertTokensToIds(List<string>)

Converts tokens to token IDs.

List<int> ConvertTokensToIds(List<string> tokens)

Parameters

tokens List<string>

The tokens to convert.

Returns

List<int>

The token IDs.

Decode(List<int>, bool)

Decodes token IDs back into text.

string Decode(List<int> tokenIds, bool skipSpecialTokens = true)

Parameters

tokenIds List<int>

The token IDs to decode.

skipSpecialTokens bool

Whether to skip special tokens in the output.

Returns

string

The decoded text.

DecodeBatch(List<List<int>>, bool)

Decodes multiple sequences of token IDs back into text.

List<string> DecodeBatch(List<List<int>> tokenIdsBatch, bool skipSpecialTokens = true)

Parameters

tokenIdsBatch List<List<int>>

The batch of token IDs to decode.

skipSpecialTokens bool

Whether to skip special tokens in the output.

Returns

List<string>

The decoded texts.

Encode(string, EncodingOptions?)

Encodes text into tokens.

TokenizationResult Encode(string text, EncodingOptions? options = null)

Parameters

text string

The text to encode.

options EncodingOptions

Encoding options.

Returns

TokenizationResult

The tokenization result.

EncodeBatch(List<string>, EncodingOptions?)

Encodes multiple texts into tokens.

List<TokenizationResult> EncodeBatch(List<string> texts, EncodingOptions? options = null)

Parameters

texts List<string>

The texts to encode.

options EncodingOptions

Encoding options.

Returns

List<TokenizationResult>

The tokenization results.

Tokenize(string)

Tokenizes text into subword tokens (without converting to IDs).

List<string> Tokenize(string text)

Parameters

text string

The text to tokenize.

Returns

List<string>

The list of tokens.