
GitHub - evalplus/evalplus: Rigourous evaluation of LLM …
EvalPlus is a rigorous evaluation framework for LLM4Code, with: HumanEval+: 80x more tests than the original HumanEval! MBPP+: 35x more tests than the original MBPP! EvalPerf: …
EvalPlus Leaderboard - GitHub Pages
Both MBPP and MBPP+ referred in our leaderboard use a subset (399 tasks) of hand-verified problems from MBPP-sanitized (427 tasks), to make sure the programming task is well-formed …
HumanEval Pro and MBPP Pro: Evaluating Large Language …
Dec 31, 2024 · We present HumanEval Pro and MBPP Pro, two expanded versions of the traditional HumanEval and MBPP benchmarks to evaluate LLMs on self-invoking code …
[2108.07732] Program Synthesis with Large Language Models
Aug 16, 2021 · On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize …
[2310.10996] ClarifyGPT: Empowering LLM-based Code …
Oct 17, 2023 · The results show that ClarifyGPT elevates the performance (Pass@1) of GPT-4 from 70.96% to 80.80% on MBPP-sanitized. Furthermore, to perform large-scale automated …
HumanEval Pro and MBPP Pro: Evaluating Large Language …
Dec 30, 2024 · First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and …
An introduction to code LLM benchmarks for software engineers
Oct 19, 2023 · Description: The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. It was designed …
MBPP Dataset - Papers With Code
MBPP (Mostly Basic Python Programming) Introduced by Austin et al. in Program Synthesis with Large Language Models The benchmark consists of around 1,000 crowd-sourced Python …
Releases · evalplus/evalplus - GitHub
Oct 20, 2024 · # Step 1: Generate MBPP solutions from evalplus. data import get_mbpp_plus, write_jsonl def GEN_SOLUTION (prompt: str) -> str: # LLM produce the whole solution based …
Benchmarks by EvalPlus Team
EvalPlus made HumanEval+ & MBPP+ by extending the tests by 80x/35x for rigorous eval. Based on Differential Performance Evaluation proposed by our COLM'24 paper, we rigorously …
- Some results have been removed