Benchmarks & Datasets

Coding Ability

HumanEval

Website | HuggingFace | Chen et al. 2021

Introduced in 2021 by OpenAI, along with the "Codex" model.

MBPP

Website | HuggingFace | Austin et al. 2021

The Mostly Basic Python Problems dataset,

"consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases."

EvalPlus / HumanEval+

Website | Liu et. al. 2023

Adds 81x more tests for HumanEval
Tools for working with the test inputs and outputs

MultiPL-E

Website | HuggingFace | Cassano et al. 2022

Extends HumanEval and MBPP into 18 additional languages.