evaluation_image stringlengths 54 73 | instance_id stringlengths 3 22 | package_name stringlengths 3 46 | start_instruction stringlengths 1.17k 32.7k | test_cases_num int64 4 2.18k | verify_cmd listlengths 1 4 | verify_files listlengths 1 12 |
|---|---|---|---|---|---|---|
ghcr.io/multimodal-art-projection/nl2repobench/schema:1.0 | schema | schema | ## Schema Project Introduction and Goals
Schema is a lightweight library **for Python data structure validation**. It can parse and validate various data formats (supporting Python native data structures such as dictionaries, lists, tuples, and sets) and ensure that the data conforms to a predefined schema. This tool ... | 118 | [
"pip install -e .",
"pytest --continue-on-collection-errors test_schema.py"
] | [
"tests"
] |
ghcr.io/multimodal-art-projection/nl2repobench/funcy:1.0 | funcy | funcy | # Introduction and Goals of the Funcy Project
Funcy is a utility library for functional programming in Python, providing Python developers with rich functional programming abstractions and practical tools. It supports various scenarios such as collection operations, function composition, flow control, and debugging to... | 203 | [
"pip install -e .",
"pytest --continue-on-collection-errors tests"
] | [
"tests"
] |
ghcr.io/multimodal-art-projection/nl2repobench/cherry:1.0 | cherry | Cherry | ## Introduction and Goals of the Cherry Project
Cherry is a lightweight Python library **for text classification** that enables users without machine learning knowledge to quickly train a high-accuracy model within 5 minutes. This tool aims to significantly lower the threshold for text classification tasks, allowing d... | 34 | [
"pip install -e .",
"pytest --continue-on-collection-errors tests"
] | [
"tests"
] |
ghcr.io/multimodal-art-projection/nl2repobench/decouple:1.0 | decouple | decouple | ## Introduction and Goals of the Python-Decouple Project
Python-Decouple is a Python library **oriented towards configuration management separation**. It can achieve strict separation between code and configuration, support reading configuration parameters from environment variables, .env files, and .ini files, and pr... | 67 | [
"pip install -e .",
"pytest --continue-on-collection-errors tests"
] | [
"tests"
] |
ghcr.io/multimodal-art-projection/nl2repobench/jinja:1.0 | jinja | jinja | ## Introduction and Goals of the Jinja2 Project
Jinja2 is a **fast and expressive template engine** written in pure Python. It offers a non-XML syntax, supports inline expressions, and provides an optional sandbox environment. This engine is widely used in scenarios such as web development, configuration generation, a... | 911 | [
"echo Hello >> README.rst",
"echo Hello >> README.md",
"pip install -e .",
"pytest --continue-on-collection-errors tests"
] | [
"tests"
] |
ghcr.io/multimodal-art-projection/nl2repobench/freezegun:1.0 | freezegun | freezegun | "## Introduction and Goals of the freezegun Project\n\nfreezegun is a time-freezing library for Pyth(...TRUNCATED) | 133 | [
"pip install -e .",
"pytest --continue-on-collection-errors tests"
] | [
"tests"
] |
ghcr.io/multimodal-art-projection/nl2repobench/cerberus:1.0 | cerberus | cerberus | "## Project Introduction and Goals\n\n**Cerberus** is a lightweight and extensible Python data valid(...TRUNCATED) | 249 | ["pip install -e .","pytest --continue-on-collection-errors cerberus/tests cerberus/benchmarks/test_(...TRUNCATED) | [
"tests"
] |
ghcr.io/multimodal-art-projection/nl2repobench/mechanicalsoup:1.0 | mechanicalsoup | mechanicalsoup | "# Introduction to the MechanicalSoup_main Project\n\n## 1. Project Overview and Objectives\n\nMecha(...TRUNCATED) | 140 | [
"pip install -e .",
"pytest --continue-on-collection-errors tests"
] | [
"tests"
] |
ghcr.io/multimodal-art-projection/nl2repobench/ipytest:1.0 | ipytest | ipytest | "## Introduction and Goals of the ipytest Project\n\nipytest is a Python library for test execution (...TRUNCATED) | 81 | [
"pip install -e .",
"pytest --continue-on-collection-errors tests"
] | [
"tests"
] |
ghcr.io/multimodal-art-projection/nl2repobench/tinydb:1.0 | tinydb | tinydb | "## Introduction and Goals of the TinyDB Project\n\nTinyDB is a **lightweight document-oriented data(...TRUNCATED) | 203 | [
"pip install -e .",
"pytest --continue-on-collection-errors tests"
] | [
"tests"
] |
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
AweAgent-Meta-NL2Repo
This dataset provides the metadata used by AweAgent to run the NL2RepoBench evaluation.
If you are looking for the underlying benchmark itself (task design, repositories, test suites), please refer to the original project: multimodal-art-projection/NL2RepoBench.
Purpose
The AweAgent repository evaluates end-to-end repo-level code generation: given a natural-language project specification, the agent must produce a working Python package that passes the project's tests. To run that evaluation reproducibly, the harness needs a compact, machine-readable manifest of every instance — that is exactly what this dataset provides.
Concretely, each row tells AweAgent:
- which prebuilt evaluation Docker image to launch,
- the NL prompt that defines the target repository,
- the verification command and the files that must be produced,
- the number of test cases the generated repo will be scored against.
Files
nl2repo_aweagent.jsonl— one JSON object per NL2RepoBench instance (104 instances).
Schema
| Field | Type | Description |
|---|---|---|
instance_id |
str |
Unique identifier for the instance (typically the target package name). |
package_name |
str |
The Python package the agent is expected to generate. |
evaluation_image |
str |
Docker image (hosted under ghcr.io/multimodal-art-projection/nl2repobench) used to evaluate the generated repository. |
start_instruction |
str |
The natural-language task description handed to the agent as the starting prompt. |
verify_files |
list[str] |
Files that must be produced by the agent and are checked during verification. |
verify_cmd |
str |
The command executed inside the evaluation image to verify the generated repository. |
test_cases_num |
int |
Number of test cases used to score the instance. |
Usage
from datasets import load_dataset
ds = load_dataset("AweAI-Team/AweAgent-Meta-NL2Repo", split="train")
print(ds[0])
This manifest is consumed by the evaluation pipeline in AweAgent; see that repository for the full runner, scoring logic, and reproduction instructions.
Acknowledgements
This dataset is built on top of, and would not exist without, the excellent NL2RepoBench project by the Multimodal Art Projection team. All benchmark instances, evaluation images, and test cases originate from their work; this dataset only repackages the per-instance metadata in the form AweAgent's evaluation harness expects. Huge thanks to the NL2RepoBench authors for releasing such a high-quality repository-level code-generation benchmark.
License
Released under CC BY 4.0. When using this dataset, please also cite and credit the upstream NL2RepoBench project.
- Downloads last month
- 33