SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Comprehensive Coverage: SWE-Compass includes 8 task types, 8 programming scenarios, and 10 programming languages.
Real-world Alignment: The benchmark uses 2000 instances curated from GitHub pull requests, ensuring relevance to actual developer workflows.
Benchmarking Frameworks: It evaluates LLMs under two frameworks, SWE-Agent and Claude Code, revealing a hierarchy of task difficulty.

SWE-Compass is a new benchmark designed to evaluate the coding abilities of large language models (LLMs) in software engineering. It addresses the limitations of existing benchmarks by covering a wide range of tasks, scenarios, and languages.