Table of Contents

Class CodeBertTokenizer

Namespace
AiDotNet.Tokenization.CodeTokenization
Assembly
AiDotNet.dll

CodeBERT-compatible tokenizer for program synthesis and code understanding tasks. Combines WordPiece tokenization with code-aware preprocessing.

public class CodeBertTokenizer
Inheritance
CodeBertTokenizer
Inherited Members

Constructors

CodeBertTokenizer(IVocabulary, ProgrammingLanguage, SpecialTokens?)

Creates a new CodeBERT tokenizer.

public CodeBertTokenizer(IVocabulary vocabulary, ProgrammingLanguage language = ProgrammingLanguage.Generic, SpecialTokens? specialTokens = null)

Parameters

vocabulary IVocabulary

The vocabulary.

language ProgrammingLanguage

The programming language.

specialTokens SpecialTokens

The special tokens (BERT-style by default).

Properties

Tokenizer

Gets the underlying tokenizer.

public ITokenizer Tokenizer { get; }

Property Value

ITokenizer

Methods

Decode(List<int>, bool)

Decodes token IDs back to code.

public string Decode(List<int> tokenIds, bool skipSpecialTokens = true)

Parameters

tokenIds List<int>
skipSpecialTokens bool

Returns

string

EncodeCodeAndNL(string, string?, EncodingOptions?)

Encodes code and natural language for CodeBERT.

public TokenizationResult EncodeCodeAndNL(string code, string? naturalLanguage = null, EncodingOptions? options = null)

Parameters

code string

The code snippet.

naturalLanguage string

The natural language description (optional).

options EncodingOptions

Encoding options.

Returns

TokenizationResult

The tokenization result.