Class CodeBertTokenizer
- Namespace
- AiDotNet.Tokenization.CodeTokenization
- Assembly
- AiDotNet.dll
CodeBERT-compatible tokenizer for program synthesis and code understanding tasks. Combines WordPiece tokenization with code-aware preprocessing.
public class CodeBertTokenizer
- Inheritance
-
CodeBertTokenizer
- Inherited Members
Constructors
CodeBertTokenizer(IVocabulary, ProgrammingLanguage, SpecialTokens?)
Creates a new CodeBERT tokenizer.
public CodeBertTokenizer(IVocabulary vocabulary, ProgrammingLanguage language = ProgrammingLanguage.Generic, SpecialTokens? specialTokens = null)
Parameters
vocabularyIVocabularyThe vocabulary.
languageProgrammingLanguageThe programming language.
specialTokensSpecialTokensThe special tokens (BERT-style by default).
Properties
Tokenizer
Gets the underlying tokenizer.
public ITokenizer Tokenizer { get; }
Property Value
Methods
Decode(List<int>, bool)
Decodes token IDs back to code.
public string Decode(List<int> tokenIds, bool skipSpecialTokens = true)
Parameters
Returns
EncodeCodeAndNL(string, string?, EncodingOptions?)
Encodes code and natural language for CodeBERT.
public TokenizationResult EncodeCodeAndNL(string code, string? naturalLanguage = null, EncodingOptions? options = null)
Parameters
codestringThe code snippet.
naturalLanguagestringThe natural language description (optional).
optionsEncodingOptionsEncoding options.
Returns
- TokenizationResult
The tokenization result.