POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios
Published 4 Nov 2025 · arXiv · Tingyue Yang
Overview
POLIS-Bench is a novel evaluation suite designed to assess large language models (LLMs) in bilingual policy tasks within governmental scenarios. It introduces advancements in corpus size, task design, and evaluation metrics.
Key Insights
- Up-to-date Bilingual Corpus: A large, current policy corpus enhances assessment relevance.
- Scenario-Grounded Task Design: Includes Clause Retrieval & Interpretation, Solution Generation, and Compliance Judgment tasks.
- Dual-Metric Evaluation Framework: Combines semantic similarity and accuracy rate for precise evaluation.
- Performance Hierarchy: Reveals that reasoning models outperform others in task stability and accuracy.
- Cost-Effective Model Development: Fine-tuned open-source models match or exceed proprietary baselines at lower costs.
BFSI Relevance
- Why Relevant: Understanding LLM capabilities in policy tasks aids in regulatory compliance and policy development.
- Primary Sector: Financial Services
- Subsectors: Regulatory Compliance, Policy Development
- Actionable Implications:
- Evaluate LLMs for policy-related tasks.
- Consider open-source models for cost-effective solutions.
- Leverage dual-metric frameworks for precise model assessment.
researcher peer-reviewed-paper global