BFSI insights

AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science

Published 7 Nov 2025 · arXiv · Qiuhai Zeng
arXiv preview

Overview

AIRepr is a framework designed to evaluate and enhance the reproducibility of data science workflows generated by large language models (LLMs). It aims to improve transparency and reliability in human-AI collaboration by focusing on the logical plans guiding code generation.

Key Insights

  • Reproducibility and Accuracy: Workflows with higher reproducibility also yield more accurate analyses. This was demonstrated through benchmarking against 1,032 tasks.
    • Evidence: Benchmarked against 15 analyst-inspector LLM pairs and 1,032 tasks from three public benchmarks.
    • Verifiable: Yes
  • Reproducibility-Enhancing Prompts: These prompts significantly improve both reproducibility and accuracy compared to standard prompting strategies.
    • Evidence: Comparison with standard prompting across multiple tasks.
    • Verifiable: Yes

BFSI Relevance

  • Why Relevant: Ensures reliable and transparent AI-driven data analysis, crucial for decision-making in financial services.
  • Primary Sector: Financial Services
  • Subsectors: Data Analysis, Risk Management
  • Actionable Implications:
    • Implement reproducibility-enhancing strategies in AI-driven data analysis.
    • Use AIRepr framework to evaluate and improve current AI workflows.
    • Enhance transparency in AI-driven decision-making processes.
professional peer-reviewed-paper global