Table of Contents

Namespace AiDotNet.Tokenization.CodeTokenization

Classes

CodeBertTokenizer

CodeBERT-compatible tokenizer for program synthesis and code understanding tasks. Combines WordPiece tokenization with code-aware preprocessing.

CodeTokenizer

Code-aware tokenizer that handles programming language constructs. Supports identifier splitting, keyword recognition, and language-specific patterns.

TreeSitterTokenizer

AST-aware tokenizer using Tree-sitter for parsing source code into syntax trees. Provides structure-aware tokenization that understands programming language grammar.

Enums

ProgrammingLanguage

Programming languages supported by the code tokenizer.

TreeSitterLanguage

Supported programming languages for Tree-sitter parsing.