RepoExec Leaderboard
RepoExec: Evaluate Code Generation with a Repository-Level Executable Benchmark
π Notes
- Evaluated using RepoExec
- Models are ranked according to Pass@1 using greedy decoding.
- β¨ marks models evaluated using a chat setting, while others perform direct code completion. We note that some instruction-tuned models miss the chat template in their tokenizer configuration.
- Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
- π means open weights and open data. π means open weights and open SFT data, but the base model is not data-open. What does this imply? ππ models open-source the data such that one can concretely reason about contamination.
- "Size" here is the amount of activated model weight during inference.
π€ More Leaderboards
In addition to RepoExec leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:- Bigcode-Bench Leaderboard
- EvalPlus Leaderboard
- Big Code Models Leaderboard
- Chatbot Arena Leaderboard
- CrossCodeEval
- ClassEval
- CRUXEval
- Code Lingua
- Evo-Eval
- HumanEval.jl - Julia version HumanEval with EvalPlus test cases
- InfiCoder-Eval
- LiveCodeBench
- NaturalCodeBench
- RepoBench
- SWE-bench
- TabbyML Leaderboard
- OOP
π Acknowledgements
- We thank the EvalPlus and BigCode teams for providing the leaderboard template.