Class TreeSitterTokenizer
- Namespace
- AiDotNet.Tokenization.CodeTokenization
- Assembly
- AiDotNet.dll
AST-aware tokenizer using Tree-sitter for parsing source code into syntax trees. Provides structure-aware tokenization that understands programming language grammar.
public sealed class TreeSitterTokenizer : TokenizerBase, ITokenizer, IDisposable
- Inheritance
-
TreeSitterTokenizer
- Implements
- Inherited Members
Remarks
Tree-sitter is an incremental parsing library that builds concrete syntax trees for source code. Unlike simple regex-based tokenizers, Tree-sitter understands the actual structure of code, enabling more intelligent tokenization that preserves semantic meaning.
For Beginners: Think of this tokenizer as a code-reading expert that truly understands programming languages. While simple tokenizers just split text on spaces and punctuation (like cutting a sentence into individual words), Tree-sitter actually reads and understands the code's structure.
For example, when parsing "function add(a, b) { return a + b; }":
- A simple tokenizer sees: ["function", "add", "(", "a", ",", "b", ")", "{", ...]
- Tree-sitter sees: A function declaration named "add" with parameters "a" and "b", containing a return statement with a binary expression.
This deeper understanding helps machine learning models learn code patterns more effectively, because tokens are grouped by their semantic role (function names, variable names, operators, etc.) rather than just their text content.
Constructors
TreeSitterTokenizer(ITokenizer, TreeSitterLanguage, bool, bool)
Creates a new Tree-sitter tokenizer for the specified programming language.
public TreeSitterTokenizer(ITokenizer baseTokenizer, TreeSitterLanguage language = TreeSitterLanguage.Python, bool includeNodeTypes = true, bool flattenTree = true)
Parameters
baseTokenizerITokenizerThe base tokenizer to use for subword tokenization of identifiers and literals.
languageTreeSitterLanguageThe programming language to parse.
includeNodeTypesboolWhether to include AST node types as prefix tokens (e.g., "[FUNC]", "[VAR]").
flattenTreeboolWhether to flatten the AST into a sequence or preserve tree structure markers.
Remarks
For Beginners: The base tokenizer handles breaking down individual code elements (like variable names) into smaller pieces, while Tree-sitter handles understanding the overall code structure.
- Set includeNodeTypes=true if you want tokens prefixed with their syntactic role (helps the model understand what each token represents).
- Set flattenTree=true for a simple sequence of tokens, or false to include tree structure markers like "[BEGIN_FUNC]" and "[END_FUNC]".
Methods
CleanupTokens(List<string>)
Cleans up tokens and converts them back to text.
protected override string CleanupTokens(List<string> tokens)
Parameters
Returns
- string
The reconstructed text.
CreateCSharp(ITokenizer, bool)
Creates a Tree-sitter tokenizer for C# code.
public static TreeSitterTokenizer CreateCSharp(ITokenizer baseTokenizer, bool includeNodeTypes = true)
Parameters
baseTokenizerITokenizerThe base tokenizer for subword tokenization.
includeNodeTypesboolWhether to include AST node type markers.
Returns
- TreeSitterTokenizer
A new TreeSitterTokenizer configured for C#.
CreateJava(ITokenizer, bool)
Creates a Tree-sitter tokenizer for Java code.
public static TreeSitterTokenizer CreateJava(ITokenizer baseTokenizer, bool includeNodeTypes = true)
Parameters
baseTokenizerITokenizerThe base tokenizer for subword tokenization.
includeNodeTypesboolWhether to include AST node type markers.
Returns
- TreeSitterTokenizer
A new TreeSitterTokenizer configured for Java.
CreateJavaScript(ITokenizer, bool)
Creates a Tree-sitter tokenizer for JavaScript code.
public static TreeSitterTokenizer CreateJavaScript(ITokenizer baseTokenizer, bool includeNodeTypes = true)
Parameters
baseTokenizerITokenizerThe base tokenizer for subword tokenization.
includeNodeTypesboolWhether to include AST node type markers.
Returns
- TreeSitterTokenizer
A new TreeSitterTokenizer configured for JavaScript.
CreatePython(ITokenizer, bool)
Creates a Tree-sitter tokenizer for Python code.
public static TreeSitterTokenizer CreatePython(ITokenizer baseTokenizer, bool includeNodeTypes = true)
Parameters
baseTokenizerITokenizerThe base tokenizer for subword tokenization.
includeNodeTypesboolWhether to include AST node type markers.
Returns
- TreeSitterTokenizer
A new TreeSitterTokenizer configured for Python.
Remarks
For Beginners: Use this factory method to quickly create a tokenizer for Python source code. Python is commonly used in data science and machine learning, so this is often a good default choice.
Dispose()
Releases the resources used by the Tree-sitter parser.
public void Dispose()
Remarks
For Beginners: Always dispose of this tokenizer when you're done using it. The Tree-sitter parser uses native memory that needs to be freed. The best practice is to use a "using" statement: using var tokenizer = TreeSitterTokenizer.CreatePython(baseTokenizer);
~TreeSitterTokenizer()
Finalizer to ensure resources are released.
protected ~TreeSitterTokenizer()
Tokenize(string)
Tokenizes source code using AST-aware parsing.
public override List<string> Tokenize(string text)
Parameters
textstringThe source code to tokenize.
Returns
Remarks
The tokenization process: 1. Parse the source code into an AST using Tree-sitter 2. Traverse the AST to extract meaningful nodes (identifiers, literals, keywords, operators) 3. Optionally prefix each token with its AST node type 4. Apply the base tokenizer to break down complex tokens into subwords
For Beginners: This method reads your code like a compiler would, building a tree structure that represents the code's meaning. Then it walks through that tree, collecting tokens in a way that preserves the semantic relationships.
For example, the code "x = 5 + 3" might produce:
- With includeNodeTypes=true: ["[IDENTIFIER]", "x", "[OPERATOR]", "=", "[NUMBER]", "5", ...]
- With includeNodeTypes=false: ["x", "=", "5", "+", "3"]