A Deterministic Trajectory-Level Evaluation Framework for Learning-Based Agentic Systems
DOI:
https://doi.org/10.32996/jefas.2026.8.3.3Keywords:
Agentic Systems, Trajectory-Level Evaluation, Deterministic Replay, Learning-Based Autonomous Agents, AI Governance and ReliabilityAbstract
In the recent past, learning-based agentic systems are being increasingly used to tackle complex decision environments where reliability, transparency, and compliance with governance are key. However, the dominant evaluation strategies are mostly outcome-centric, with a focus on aggregate performance measures like accuracy, rewards, or task completion rates. These evaluation strategies provide little information on the internal decision-making processes that lead to the observable outcomes, especially for reasoning-based agentic systems with multiple steps. This paper proposes a Deterministic Trajectory Level Evaluation Framework (DTLEF) for learning-based agentic systems. The proposed framework transforms the evaluation paradigm from traditional outcome-oriented metrics to an evaluation process that focuses on the trajectories of states and actions within controlled execution conditions. The DTLEF integrates standardized states, comprehensive action trace logging, deterministic replay validation, and behavior verification against governance constraints. The evaluation process, in which the agentic system runs in controlled inference mode and the trajectory traces are compared, enables the identification of policy instabilities, reasoning drift, and non-deterministic behavior. Moreover, the evaluation process also ensures that the trajectories are consistent with the predefined constraints. Unlike traditional performance metrics, the proposed framework does not depend upon empirical data or domain-specific metrics. Instead, it provides a methodology for evaluating agentic systems, including autonomous decision pipelines, tool-enhanced language agents, and cyber-physical control systems, at an architecture level. The proposed framework increases transparency, reproducibility, and compliance without modifying training procedures. This research provides a scalable and domain-independent evaluation methodology for validating learning-based autonomous agents in environments where behavioral reliability is as important as functional performance by formally defining trajectory-level determinism as a primary evaluation criterion.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 https://creativecommons.org/licenses/by/4.0/

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Aims & scope
Call for Papers
Article Processing Charges
Publications Ethics
Google Scholar Citations
Recruitment