Class MBPPBenchmark<T>
- Namespace
- AiDotNet.Reasoning.Benchmarks
- Assembly
- AiDotNet.dll
MBPP (Mostly Basic Python Problems) benchmark for evaluating Python code generation.
public class MBPPBenchmark<T> : IBenchmark<T>
Type Parameters
TThe numeric type used for scoring (e.g., double, float).
- Inheritance
-
MBPPBenchmark<T>
- Implements
-
IBenchmark<T>
- Inherited Members
Remarks
For Beginners: MBPP is a collection of basic Python programming problems, designed to test fundamental programming skills and algorithmic thinking.
What is MBPP? MBPP contains 974 short Python programming problems at an entry-level difficulty. Each problem includes:
- A natural language description
- Code solution
- 3 test cases
MBPP vs HumanEval:
- MBPP: 974 problems, more basic, multiple test cases provided
- HumanEval: 164 problems, more challenging, function signature given
- Overlap: Both test code generation, but MBPP is more comprehensive
Example problems:
Problem 1: Sum of numbers
Task: Write a function to find the sum of all numbers in a list.
Test Cases:
- sum_list([1, 2, 3]) == 6
- sum_list([10, 20]) == 30
- sum_list([]) == 0
Problem 2: Check palindrome
Task: Write a function to check if a string is a palindrome.
Test Cases:
- is_palindrome("racecar") == True
- is_palindrome("hello") == False
- is_palindrome("") == True
Problem 3: Remove duplicates
Task: Write a function to remove duplicate elements from a list.
Test Cases:
- remove_duplicates([1, 2, 2, 3]) == [1, 2, 3]
- remove_duplicates([]) == []
- remove_duplicates([1, 1, 1]) == [1]
Categories:
- List operations (sorting, filtering, searching)
- String manipulation
- Mathematical operations
- Basic algorithms (searching, sorting)
- Data structure operations (lists, dictionaries)
- Boolean logic
Difficulty levels:
- Basic: Simple operations (50% of problems)
- Intermediate: Multiple steps (40% of problems)
- Advanced: Complex logic (10% of problems)
Performance levels:
- GPT-3 (Codex): ~59%
- GPT-3.5: ~70%
- GPT-4: ~82%
- Claude 3 Opus: ~78%
- Claude 3.5 Sonnet: ~85%
- ChatGPT o1: ~90%
- AlphaCode: ~75%
- CodeGen: ~65%
Why it's useful:
- Tests basic programming competency
- More comprehensive than HumanEval (974 vs 164)
- Includes test cases (can verify correctness)
- Entry-level difficulty (good for beginners)
- Real-world relevance (common programming tasks)
Research:
- "Program Synthesis with Large Language Models" (Austin et al., 2021)
- https://arxiv.org/abs/2108.07732
- Dataset: 974 problems with solutions and test cases
- Used by Google Research for code generation evaluation
Integration with CodeExecutionVerifier: MBPP works particularly well with CodeExecutionVerifier since each problem includes test cases that can be executed to verify correctness.
Constructors
MBPPBenchmark()
public MBPPBenchmark()
Properties
BenchmarkName
Gets the name of this benchmark.
public string BenchmarkName { get; }
Property Value
Description
Gets a description of what this benchmark measures.
public string Description { get; }
Property Value
TotalProblems
Gets the total number of problems in this benchmark.
public int TotalProblems { get; }
Property Value
Methods
EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)
Evaluates a reasoning strategy on this benchmark.
public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)
Parameters
evaluateFunctionFunc<string, Task<string>>Function that takes a problem and returns an answer.
sampleSizeint?Number of problems to evaluate (null for all).
cancellationTokenCancellationTokenCancellation token.
Returns
- Task<BenchmarkResult<T>>
Benchmark results with accuracy and detailed metrics.
Remarks
For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics
The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.
LoadProblemsAsync(int?)
Loads benchmark problems (for inspection or custom evaluation).
public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)
Parameters
countint?Number of problems to load (null for all).
Returns
- Task<List<BenchmarkProblem>>
List of benchmark problems.
Remarks
For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.