Table of Contents

Class MBPPBenchmark<T>

Namespace
AiDotNet.Reasoning.Benchmarks
Assembly
AiDotNet.dll

MBPP (Mostly Basic Python Problems) benchmark for evaluating Python code generation.

public class MBPPBenchmark<T> : IBenchmark<T>

Type Parameters

T

The numeric type used for scoring (e.g., double, float).

Inheritance
MBPPBenchmark<T>
Implements
Inherited Members

Remarks

For Beginners: MBPP is a collection of basic Python programming problems, designed to test fundamental programming skills and algorithmic thinking.

What is MBPP? MBPP contains 974 short Python programming problems at an entry-level difficulty. Each problem includes:

  • A natural language description
  • Code solution
  • 3 test cases

MBPP vs HumanEval:

  • MBPP: 974 problems, more basic, multiple test cases provided
  • HumanEval: 164 problems, more challenging, function signature given
  • Overlap: Both test code generation, but MBPP is more comprehensive

Example problems:

Problem 1: Sum of numbers

Task: Write a function to find the sum of all numbers in a list.
Test Cases:
- sum_list([1, 2, 3]) == 6
- sum_list([10, 20]) == 30
- sum_list([]) == 0

Problem 2: Check palindrome

Task: Write a function to check if a string is a palindrome.
Test Cases:
- is_palindrome("racecar") == True
- is_palindrome("hello") == False
- is_palindrome("") == True

Problem 3: Remove duplicates

Task: Write a function to remove duplicate elements from a list.
Test Cases:
- remove_duplicates([1, 2, 2, 3]) == [1, 2, 3]
- remove_duplicates([]) == []
- remove_duplicates([1, 1, 1]) == [1]

Categories:

  • List operations (sorting, filtering, searching)
  • String manipulation
  • Mathematical operations
  • Basic algorithms (searching, sorting)
  • Data structure operations (lists, dictionaries)
  • Boolean logic

Difficulty levels:

  • Basic: Simple operations (50% of problems)
  • Intermediate: Multiple steps (40% of problems)
  • Advanced: Complex logic (10% of problems)

Performance levels:

  • GPT-3 (Codex): ~59%
  • GPT-3.5: ~70%
  • GPT-4: ~82%
  • Claude 3 Opus: ~78%
  • Claude 3.5 Sonnet: ~85%
  • ChatGPT o1: ~90%
  • AlphaCode: ~75%
  • CodeGen: ~65%

Why it's useful:

  • Tests basic programming competency
  • More comprehensive than HumanEval (974 vs 164)
  • Includes test cases (can verify correctness)
  • Entry-level difficulty (good for beginners)
  • Real-world relevance (common programming tasks)

Research:

  • "Program Synthesis with Large Language Models" (Austin et al., 2021)
  • https://arxiv.org/abs/2108.07732
  • Dataset: 974 problems with solutions and test cases
  • Used by Google Research for code generation evaluation

Integration with CodeExecutionVerifier: MBPP works particularly well with CodeExecutionVerifier since each problem includes test cases that can be executed to verify correctness.

Constructors

MBPPBenchmark()

public MBPPBenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>

Function that takes a problem and returns an answer.

sampleSize int?

Number of problems to evaluate (null for all).

cancellationToken CancellationToken

Cancellation token.

Returns

Task<BenchmarkResult<T>>

Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?

Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>

List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.