SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
Published 11 Nov 2025 · arXiv · Jingxuan Xu
Overview
SWE-Compass is a new benchmark designed to evaluate the coding abilities of large language models (LLMs) in software engineering. It addresses the limitations of existing benchmarks by covering a wide range of tasks, scenarios, and languages.
Key Insights
- Comprehensive Coverage: SWE-Compass includes 8 task types, 8 programming scenarios, and 10 programming languages.
- Real-world Alignment: The benchmark uses 2000 instances curated from GitHub pull requests, ensuring relevance to actual developer workflows.
- Benchmarking Frameworks: It evaluates LLMs under two frameworks, SWE-Agent and Claude Code, revealing a hierarchy of task difficulty.
BFSI Relevance
- Why Relevant: Understanding LLM capabilities in software engineering is crucial for BFSI sectors relying on automated coding and software development.
- Primary Sector: Financial Services
- Subsectors: Asset Management, Corporate Banking
- Actionable Implications:
- Evaluate LLMs for potential integration into software development processes.
- Use insights to improve coding efficiency and reduce development costs.
researcher peer-reviewed-paper cross-bfsi technology-and-data global