The Loop Does Not Break. The Mental Model Does

I upgraded my AI report generator and found violations that my validation system missed. Here is what broke and why human oversight is version-specific.

The Loop Does Not Break. The Mental Model Does

I was reviewing attendance records when I noticed one employee had logged 35 days in a month. Every month, I look at associates who had worked more than 30 days, checking for overtime patterns. Thirty-five days made no sense. I traced it further and found the employee had worked across three locations continuously, crossing 24 hours without a break, violating our operational cadence. My validation tests had passed. The report had flagged nothing.

I had just upgraded to Haiku 4.5.

Let me explain how I got there. I run payroll for over 2,500+ field staff deployed across multiple locations at Knighthood. In my workflow, labour rules get reinterpreted often enough that hard-coding them into traditional software is impractical. I am not a developer. So I built the payroll report on an AI workflow instead. Simple prompt edits when rules change. Faster to build, easier to maintain, and within my actual skill set.

I knew going in that LLM-based workflows can vary across runs and versions. A traditional software program gives the same output for the same input every time. An LLM does not. The model predicts the most plausible response, not a guaranteed one. I accepted this. The tradeoff was worth it: speed and flexibility over certainty.

So I built validation tests to manage the uncertainty. Six of them.

  • The first validated margins
  • The Second checked overtime calculations.
  • The third caught labour law violations.
  • The fourth confirmed all employees were paid correctly
  • The fifth validated compliance payments.
  • The sixth ran a complete recheck before generating the final report.

I was the human in the loop, and I knew exactly where Haiku 4 tended to drift.

When Haiku 4.5 arrived, the improvement was immediate. The model no longer made the overtime calculation errors that had forced my workarounds. Three of my six tests became redundant. I merged the overlapping checks into fewer steps. Token consumption dropped from 50,000 to 20,000 per run.

My assumption going in was straightforward: upgrading the model means progress, and progress means fewer things to worry about. The new model would be better. Some old tests would become redundant. That was the expected outcome.

What I did not account for was that new models do not only fix old failures. They introduce new ones.

The 24-hour continuous work flag was one of them. In my experience across two model upgrades, some known failures disappeared and new ones appeared. The improvement is real. So is the tradeoff. Haiku 4.5 was better on every dimension I had trained it to handle. In one edge case I had not known to check for, it was not.

My tests did not catch it. I had written them for Haiku 4’s failure patterns. Haiku 4.5 had different ones.

I found the gap through manual review, not through the system I had built. A lucky catch. With over 2,500+ staff deployed, a full manual review of every record is not feasible. I was investigating something else entirely when the anomaly surfaced. I cannot rely on that happening again.

I added a new validation test for the edge case and rebuilt the affected steps. Final token consumption settled at 26,000 against the original 50,000 on Haiku 4. That is 52 per cent of the original cost. The specific overtime errors I had built workarounds for no longer appeared. The system works better now. But the gap between 20,000 tokens and 26,000 tokens is the cost of a failure I found by chance.

Here is what that upgrade cost that does not show up in the token count. A trained staff member reviewing the same report might not have found the violation. They would have been looking for what I had taught them: Haiku 4’s specific drift patterns. Those patterns were mostly gone. The new ones were invisible to anyone whose mental model was calibrated to the old version.

This is not a story about AI being unreliable. It is a story about what reliability actually means in a system where the foundation keeps improving.

I am building O9X, an operations tool for my team at Knighthood. I started in September 2025, working with Claude Sonnet 4.5. I ran into real problems with complex UI logic. Then Sonnet 4.6 arrived. Most of those problems resolved. My timeline accelerated. Sonnet 4.6 also introduced new UI errors that Sonnet 4.5 did not have. I spent extra time explaining the workflow more clearly before I could trust the output again.

Two Claude versions across the same build. Each one shifting what worked and what required new workarounds.

I had planned to go live in April 2026. I am still building. Operations module done. Payroll next. Then accounting. Four more months before the system is stable enough to call complete. Not because the models are bad. Because the foundation keeps improving, and every improvement requires me to revisit what I built on top of it.

This is the property that makes human-in-the-loop difficult right now. Reliability is version-specific. What the model gets right, what it gets wrong, where it drifts: all of this shifts with every release. Human oversight is only as useful as the mental model behind it is current.

There is a pattern I have noticed in practitioner communities that points in the same direction. In communities such as LocalLLaMA, Llama 3.1 8B is still discussed as a useful fine-tuning base well after release. The appeal is not necessarily frontier capability. It is practical usability: accessible fine-tuning, lower hardware demands, and good-enough performance for specialized tasks. That logic holds in my own experience. A model you know deeply outperforms a better model you have not yet learned to supervise.

The problem is when companies upgrade the model and assume the oversight carries over automatically. In vendor-managed setups, the foundation may shift underneath the workflow regardless of what the operator decides. For them, every vendor update is a potential system change. Map what might have shifted. Run validation checks against the new version. Verify that the human reviewer’s mental model still matches what the system is actually producing.

Before I make any change to a system now, I stop and map how it is actually connected. A system produces patterns through information flows between nodes. Change one node and you change how information reaches every other node. The output pattern shifts. So before any change, I identify the core elements. Then I trace how information moves between them. Then I ask which patterns will be different on the other side.

For an AI workflow, that means asking:

  • Which human mental models were built around the current version’s behaviour?
  • Which validation steps check for failure patterns that may no longer exist?
  • Which new failure patterns might now be running through unchecked?

This is not a checklist. It is a way of seeing the system before you disturb it.

Human-in-the-loop is not a bad framework. It holds when the model is frozen and the human knows its exact failure patterns. Apply it to a foundation that keeps improving and you get a loop that looks intact from the outside. It fails quietly, in the places no one knows to look yet.

The question worth asking before the next upgrade is not whether your human is in the loop. It is whether their mental model is still in the same loop the model is running.