RepoExec Leaderboard

RepoExec: Evaluate Code Generation with a Repository-Level Executable Benchmark


blog leaderboard data
github paper

πŸ“ Notes

  1. Evaluated using RepoExec
  2. Models are ranked according to Pass@1 using greedy decoding.
  3. ✨ marks models evaluated using a chat setting, while others perform direct code completion. We note that some instruction-tuned models miss the chat template in their tokenizer configuration.
  4. Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
  5. πŸ’š means open weights and open data. πŸ’™ means open weights and open SFT data, but the base model is not data-open. What does this imply? πŸ’šπŸ’™ models open-source the data such that one can concretely reason about contamination.
  6. "Size" here is the amount of activated model weight during inference.

πŸ€— More Leaderboards

In addition to RepoExec leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:

πŸ™ Acknowledgements

  • We thank the EvalPlus and BigCode teams for providing the leaderboard template.