AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science

Overview

AIRepr is a framework designed to evaluate and enhance the reproducibility of data science workflows generated by large language models (LLMs). It aims to improve transparency and reliability in human-AI collaboration by focusing on the logical plans guiding code generation.

Key Insights

Reproducibility and Accuracy: Workflows with higher reproducibility also yield more accurate analyses. This was demonstrated through benchmarking against 1,032 tasks.
- Evidence: Benchmarked against 15 analyst-inspector LLM pairs and 1,032 tasks from three public benchmarks.
- Verifiable: Yes
Reproducibility-Enhancing Prompts: These prompts significantly improve both reproducibility and accuracy compared to standard prompting strategies.
- Evidence: Comparison with standard prompting across multiple tasks.
- Verifiable: Yes

BFSI Relevance

Why Relevant: Ensures reliable and transparent AI-driven data analysis, crucial for decision-making in financial services.
Primary Sector: Financial Services
Subsectors: Data Analysis, Risk Management
Actionable Implications:
- Implement reproducibility-enhancing strategies in AI-driven data analysis.
- Use AIRepr framework to evaluate and improve current AI workflows.
- Enhance transparency in AI-driven decision-making processes.