BFSI insights

POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios

Published 4 Nov 2025 · arXiv · Tingyue Yang
arXiv preview

Overview

POLIS-Bench is a novel evaluation suite designed to assess large language models (LLMs) in bilingual policy tasks within governmental scenarios. It introduces advancements in corpus size, task design, and evaluation metrics.

Key Insights

  • Up-to-date Bilingual Corpus: A large, current policy corpus enhances assessment relevance.
  • Scenario-Grounded Task Design: Includes Clause Retrieval & Interpretation, Solution Generation, and Compliance Judgment tasks.
  • Dual-Metric Evaluation Framework: Combines semantic similarity and accuracy rate for precise evaluation.
  • Performance Hierarchy: Reveals that reasoning models outperform others in task stability and accuracy.
  • Cost-Effective Model Development: Fine-tuned open-source models match or exceed proprietary baselines at lower costs.

BFSI Relevance

  • Why Relevant: Understanding LLM capabilities in policy tasks aids in regulatory compliance and policy development.
  • Primary Sector: Financial Services
  • Subsectors: Regulatory Compliance, Policy Development
  • Actionable Implications:
    • Evaluate LLMs for policy-related tasks.
    • Consider open-source models for cost-effective solutions.
    • Leverage dual-metric frameworks for precise model assessment.
researcher peer-reviewed-paper global