AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science
Published 7 Nov 2025 · arXiv · Qiuhai Zeng
Overview
AIRepr is a framework designed to evaluate and enhance the reproducibility of data science workflows generated by large language models (LLMs). It aims to improve transparency and reliability in human-AI collaboration by focusing on the logical plans guiding code generation.
Key Insights
- Reproducibility and Accuracy: Workflows with higher reproducibility also yield more accurate analyses. This was demonstrated through benchmarking against 1,032 tasks.
- Evidence: Benchmarked against 15 analyst-inspector LLM pairs and 1,032 tasks from three public benchmarks.
- Verifiable: Yes
- Reproducibility-Enhancing Prompts: These prompts significantly improve both reproducibility and accuracy compared to standard prompting strategies.
- Evidence: Comparison with standard prompting across multiple tasks.
- Verifiable: Yes
BFSI Relevance
- Why Relevant: Ensures reliable and transparent AI-driven data analysis, crucial for decision-making in financial services.
- Primary Sector: Financial Services
- Subsectors: Data Analysis, Risk Management
- Actionable Implications:
- Implement reproducibility-enhancing strategies in AI-driven data analysis.
- Use AIRepr framework to evaluate and improve current AI workflows.
- Enhance transparency in AI-driven decision-making processes.
professional peer-reviewed-paper global