Table of Contents

Class TreeSitterTokenizer

Namespace
AiDotNet.Tokenization.CodeTokenization
Assembly
AiDotNet.dll

AST-aware tokenizer using Tree-sitter for parsing source code into syntax trees. Provides structure-aware tokenization that understands programming language grammar.

public sealed class TreeSitterTokenizer : TokenizerBase, ITokenizer, IDisposable
Inheritance
TreeSitterTokenizer
Implements
Inherited Members

Remarks

Tree-sitter is an incremental parsing library that builds concrete syntax trees for source code. Unlike simple regex-based tokenizers, Tree-sitter understands the actual structure of code, enabling more intelligent tokenization that preserves semantic meaning.

For Beginners: Think of this tokenizer as a code-reading expert that truly understands programming languages. While simple tokenizers just split text on spaces and punctuation (like cutting a sentence into individual words), Tree-sitter actually reads and understands the code's structure.

For example, when parsing "function add(a, b) { return a + b; }":

  • A simple tokenizer sees: ["function", "add", "(", "a", ",", "b", ")", "{", ...]
  • Tree-sitter sees: A function declaration named "add" with parameters "a" and "b", containing a return statement with a binary expression.

This deeper understanding helps machine learning models learn code patterns more effectively, because tokens are grouped by their semantic role (function names, variable names, operators, etc.) rather than just their text content.

Constructors

TreeSitterTokenizer(ITokenizer, TreeSitterLanguage, bool, bool)

Creates a new Tree-sitter tokenizer for the specified programming language.

public TreeSitterTokenizer(ITokenizer baseTokenizer, TreeSitterLanguage language = TreeSitterLanguage.Python, bool includeNodeTypes = true, bool flattenTree = true)

Parameters

baseTokenizer ITokenizer

The base tokenizer to use for subword tokenization of identifiers and literals.

language TreeSitterLanguage

The programming language to parse.

includeNodeTypes bool

Whether to include AST node types as prefix tokens (e.g., "[FUNC]", "[VAR]").

flattenTree bool

Whether to flatten the AST into a sequence or preserve tree structure markers.

Remarks

For Beginners: The base tokenizer handles breaking down individual code elements (like variable names) into smaller pieces, while Tree-sitter handles understanding the overall code structure.

  • Set includeNodeTypes=true if you want tokens prefixed with their syntactic role (helps the model understand what each token represents).
  • Set flattenTree=true for a simple sequence of tokens, or false to include tree structure markers like "[BEGIN_FUNC]" and "[END_FUNC]".

Methods

CleanupTokens(List<string>)

Cleans up tokens and converts them back to text.

protected override string CleanupTokens(List<string> tokens)

Parameters

tokens List<string>

The tokens to clean up.

Returns

string

The reconstructed text.

CreateCSharp(ITokenizer, bool)

Creates a Tree-sitter tokenizer for C# code.

public static TreeSitterTokenizer CreateCSharp(ITokenizer baseTokenizer, bool includeNodeTypes = true)

Parameters

baseTokenizer ITokenizer

The base tokenizer for subword tokenization.

includeNodeTypes bool

Whether to include AST node type markers.

Returns

TreeSitterTokenizer

A new TreeSitterTokenizer configured for C#.

CreateJava(ITokenizer, bool)

Creates a Tree-sitter tokenizer for Java code.

public static TreeSitterTokenizer CreateJava(ITokenizer baseTokenizer, bool includeNodeTypes = true)

Parameters

baseTokenizer ITokenizer

The base tokenizer for subword tokenization.

includeNodeTypes bool

Whether to include AST node type markers.

Returns

TreeSitterTokenizer

A new TreeSitterTokenizer configured for Java.

CreateJavaScript(ITokenizer, bool)

Creates a Tree-sitter tokenizer for JavaScript code.

public static TreeSitterTokenizer CreateJavaScript(ITokenizer baseTokenizer, bool includeNodeTypes = true)

Parameters

baseTokenizer ITokenizer

The base tokenizer for subword tokenization.

includeNodeTypes bool

Whether to include AST node type markers.

Returns

TreeSitterTokenizer

A new TreeSitterTokenizer configured for JavaScript.

CreatePython(ITokenizer, bool)

Creates a Tree-sitter tokenizer for Python code.

public static TreeSitterTokenizer CreatePython(ITokenizer baseTokenizer, bool includeNodeTypes = true)

Parameters

baseTokenizer ITokenizer

The base tokenizer for subword tokenization.

includeNodeTypes bool

Whether to include AST node type markers.

Returns

TreeSitterTokenizer

A new TreeSitterTokenizer configured for Python.

Remarks

For Beginners: Use this factory method to quickly create a tokenizer for Python source code. Python is commonly used in data science and machine learning, so this is often a good default choice.

Dispose()

Releases the resources used by the Tree-sitter parser.

public void Dispose()

Remarks

For Beginners: Always dispose of this tokenizer when you're done using it. The Tree-sitter parser uses native memory that needs to be freed. The best practice is to use a "using" statement: using var tokenizer = TreeSitterTokenizer.CreatePython(baseTokenizer);

~TreeSitterTokenizer()

Finalizer to ensure resources are released.

protected ~TreeSitterTokenizer()

Tokenize(string)

Tokenizes source code using AST-aware parsing.

public override List<string> Tokenize(string text)

Parameters

text string

The source code to tokenize.

Returns

List<string>

A list of tokens representing the code structure.

Remarks

The tokenization process: 1. Parse the source code into an AST using Tree-sitter 2. Traverse the AST to extract meaningful nodes (identifiers, literals, keywords, operators) 3. Optionally prefix each token with its AST node type 4. Apply the base tokenizer to break down complex tokens into subwords

For Beginners: This method reads your code like a compiler would, building a tree structure that represents the code's meaning. Then it walks through that tree, collecting tokens in a way that preserves the semantic relationships.

For example, the code "x = 5 + 3" might produce:

  • With includeNodeTypes=true: ["[IDENTIFIER]", "x", "[OPERATOR]", "=", "[NUMBER]", "5", ...]
  • With includeNodeTypes=false: ["x", "=", "5", "+", "3"]