A Formal Framework for Evaluating Reasoning Integrity in Language Models

AI Evaluation March 2026

Traditional language model evaluation often focuses on final-answer accuracy, which gives limited visibility into the reasoning process that produced the answer. This paper proposes a formal framework for evaluating reasoning integrity by modeling inference as a trajectory of belief states under uncertainty.

The framework defines externally observable belief states that capture hypotheses, uncertainty distributions, and constraints at each reasoning step. This allows reasoning trajectories to be analyzed without relying on internal model representations.

The paper introduces a divergence functional for measuring sustained disagreement between reasoning trajectories, alongside a complexity regularization term that penalizes excessive or redundant reasoning. These components are combined into a unified score balancing consistency and parsimony.

To operationalize the framework, the paper presents a multi-stage evaluation protocol that constrains intermediate reasoning, injects minimal adversarial perturbations, and measures both divergence and repair cost. The proposed metrics are shown to be bounded, invariant under semantic-preserving transformations, and stable under controlled perturbations.

Details

Authors: Amaya Kavya, Shardul Shinde
DOI: 10.20944/preprints202603.2034.v1
License: CC BY 4.0
DOI · View paper

← Back to Research