LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows
Published 10 Nov 2025 · arXiv · Raffi Khatchadourian
Overview
This paper examines the output drift of Large Language Models (LLMs) in financial workflows, revealing that smaller models achieve higher consistency than larger ones. It challenges the assumption that larger models are always better for production deployment.
Key Insights
- Smaller Models' Consistency: Models like Granite-3-8B and Qwen2.5-7B achieve 100% output consistency at T=0.0, while GPT-OSS-120B shows only 12.5% consistency.
- Evidence: Consistency measured across 480 runs on regulated financial tasks.
- Verifiable: Yes, through the study's methodology.
- Deterministic Test Harness: A finance-calibrated test harness using greedy decoding and fixed seeds ensures consistent outputs.
- Evidence: Methodology detailed in the paper.
- Verifiable: Yes, through replication of the study.
- Task Sensitivity: Structured tasks like SQL remain stable, while RAG tasks show drift between 25-75%.
- Evidence: Results from cross-provider validation.
- Verifiable: Yes, through the study's data.
BFSI Relevance
- Why Relevant: Ensures reliable AI deployment in financial services, crucial for auditability and compliance.
- Primary Sector: Financial Services
- Subsectors: Regulatory Reporting, Client Communications
- Actionable Implications:
- Evaluate model size and architecture for consistency in financial tasks.
- Implement deterministic testing frameworks for AI deployments.
- Consider smaller models for tasks requiring high consistency.
researcher peer-reviewed-paper financial-services technology-and-data global