RepoExec Leaderboard

RepoExec: Evaluate Code Generation with a Repository-Level Executable Benchmark

Complete Instruct

Show Models with Unknown Sizes

📝 Notes

Evaluated using RepoExec
Models are ranked according to Pass@1 using greedy decoding.
✨ marks models evaluated using a chat setting, while others perform direct code completion. We note that some instruction-tuned models miss the chat template in their tokenizer configuration.
Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
💚 means open weights and open data. 💙 means open weights and open SFT data, but the base model is not data-open. What does this imply? 💚💙 models open-source the data such that one can concretely reason about contamination.
"Size" here is the amount of activated model weight during inference.

🤗 More Leaderboards

In addition to RepoExec leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:

🙏 Acknowledgements

We thank the EvalPlus and BigCode teams for providing the leaderboard template.