Class CodeTokenizer
- Namespace
- AiDotNet.Tokenization.CodeTokenization
- Assembly
- AiDotNet.dll
Code-aware tokenizer that handles programming language constructs. Supports identifier splitting, keyword recognition, and language-specific patterns.
public class CodeTokenizer : TokenizerBase, ITokenizer
- Inheritance
-
CodeTokenizer
- Implements
- Inherited Members
Constructors
CodeTokenizer(ITokenizer, ProgrammingLanguage, bool)
Creates a new code tokenizer.
public CodeTokenizer(ITokenizer baseTokenizer, ProgrammingLanguage language = ProgrammingLanguage.Generic, bool splitIdentifiers = true)
Parameters
baseTokenizerITokenizerThe base tokenizer to use for subword tokenization.
languageProgrammingLanguageThe programming language.
splitIdentifiersboolWhether to split identifiers (camelCase, snake_case).
Methods
CleanupTokens(List<string>)
Cleans up tokens and converts them back to code.
protected override string CleanupTokens(List<string> tokens)
Parameters
Returns
Encode(string, EncodingOptions?)
Encodes code into a tokenization result with best-effort character offsets.
public override TokenizationResult Encode(string text, EncodingOptions? options = null)
Parameters
textstringoptionsEncodingOptions
Returns
Tokenize(string)
Tokenizes code with language-aware handling.
public override List<string> Tokenize(string text)
Parameters
textstring