FEA-bench

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, Scarlett Li

Peking University, Micorsoft Research Asia

Paper GitHub Dataset

Overview

FEA-bench tests AI systems' ability to implement new features in the Github repositories.

We collect 1,401 task instances by crawling Pull Requests from 83 popular Python repositories. Each instance is based on a pull request that

modified 1+ testing related files
have newly implemented components (function/class)
with a general purpose of implement a new feature

Per instance, we construct an execution environment (Docker Image) with the repository successfully installed at the commit that the Pull Request is based on, following SWE-bench. Without the Pull Request's changes, a number of test(s) fail. After the Pull Request is merged, the same set of test(s) pass. These "Fail-to-Pass" tests are the primary signal for evaluation.

FEA-bench evaluation works as follows. Per task instance, an AI system is given the request text of a new feature, and the positions and specified names of new components. The AI system should then modify the codebase in order to implement new feature. When the AI system is finished, we run the aforementioned Fail-to-Pass tests to check if the new feature was successfully implemented.

FEA-bench was released in April 2025, where our initial Retrieval Augmented Generation (RAG) using BM25 scored just 10.49% by DeepSeek-R1. And the Agentless, designed for bug fix, only scored 14.0% on FEA-Bench Lite.

Resources

We provide the basic data for the FEA-Bench test set in the huggingface.

🤗 FEA-Bench 🧑‍💻Code For Processing And Evaluation

⚠️ Due to licensing and company policies, we cannot release the full dataset in the huggingface. Our published version only includes essential attributes, and the remaining content needs to be scraped from GitHub, by the provided code in the FEA-Bench official repository.

Citation

If you use FEA-bench in your research, please cite our paper:

@inproceedings{li-etal-2025-fea,
    title = "{FEA}-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation",
    author = "Li, Wei  and
      Zhang, Xin  and
      Guo, Zhongxin  and
      Mao, Shaoguang  and
      Luo, Wen  and
      Peng, Guangyue  and
      Huang, Yangyu  and
      Wang, Houfeng  and
      Li, Scarlett",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.839/",
    pages = "17160--17176",
    ISBN = "979-8-89176-251-0"
}