Table of Contents

Class CodeTokenizer

Namespace
AiDotNet.Tokenization.CodeTokenization
Assembly
AiDotNet.dll

Code-aware tokenizer that handles programming language constructs. Supports identifier splitting, keyword recognition, and language-specific patterns.

public class CodeTokenizer : TokenizerBase, ITokenizer
Inheritance
CodeTokenizer
Implements
Inherited Members

Constructors

CodeTokenizer(ITokenizer, ProgrammingLanguage, bool)

Creates a new code tokenizer.

public CodeTokenizer(ITokenizer baseTokenizer, ProgrammingLanguage language = ProgrammingLanguage.Generic, bool splitIdentifiers = true)

Parameters

baseTokenizer ITokenizer

The base tokenizer to use for subword tokenization.

language ProgrammingLanguage

The programming language.

splitIdentifiers bool

Whether to split identifiers (camelCase, snake_case).

Methods

CleanupTokens(List<string>)

Cleans up tokens and converts them back to code.

protected override string CleanupTokens(List<string> tokens)

Parameters

tokens List<string>

Returns

string

Encode(string, EncodingOptions?)

Encodes code into a tokenization result with best-effort character offsets.

public override TokenizationResult Encode(string text, EncodingOptions? options = null)

Parameters

text string
options EncodingOptions

Returns

TokenizationResult

Tokenize(string)

Tokenizes code with language-aware handling.

public override List<string> Tokenize(string text)

Parameters

text string

Returns

List<string>