FEA-bench Lite

A curated subset of FEA-bench for faster, more cost-effective evaluation.

Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, Scarlett Li

Peking University, Micorsoft Research Asia

Paper GitHub Dataset

Overview

FEA-bench Lite provides a smaller, carefully selected subset of 300 tasks from the full benchmark, designed to:

Reduce evaluation costs while maintaining benchmark quality
Enable faster iteration cycles for model development or agent development

The 200 tasks were selected to preserve the distribution and lower the difficulty of the original benchmark while focusing on new feature implementation.

While the full FEA-bench test split comprises 1,401 pull requests across 83 Python repositories, FEA-bench Lite covers 48 of the original 83 repositories with a similar diversity and distribution.

When compute efficiency is a concern, we recommend evaluating on this lite benchmark.

Selection Criteria

Task instances meeting any of the following low-quality criteria are excluded:

The feature request descriptions contain fewer than 40 words.
The instance involves cascading issues or commit SHA-256 references.
The descriptions contain images, which cannot be read by code large language models.

Additionally, to limit task difficulty, instances meeting any of the following criteria are also excluded:

Involve deleting code files.
Involve more than three code files.
The gold patch contains More than 10 code change hunks.
TNatural-formatted code change content exceeding 4K(4096) tokens.
Contain new class(es).
Contain more than ten added functions.