Agents at Work: Phase 4 Report
A behavioural audit of how AI judgement holds under repetition, constraint and cross-system evaluation.
Best place to start if you are interested in AI reliability, stability, and real-world evaluation.
DESCRIPTION
This report presents Phase 4 of the Agents at Work research series, examining how an AI system behaves when assessing age-related bias in job adverts under repeated evaluation.
Building on earlier phases, which identified variation in AI judgement, Phase 4 examines how that behaviour holds at scale and under structured test conditions.
The report applies a behavioural audit approach to observe how judgement changes when the same task is repeated, when input is constrained, and when outputs are compared across systems.
Rather than evaluating individual outputs, the focus is on observable system behaviour.
Key findings include:
- Judgement stability is conditional, not absolute
- Confidence remains stable even when judgements vary
- Explanations can differ while remaining plausible
- Reduced context produces more uniform, not more cautious, outputs
- Agreement and confidence do not align as reliability signals
Taken together, these findings highlight a distinction between what AI systems produce and how they behave over time.
WHAT THIS REPORT DOES
Phase 4 examines how AI judgement behaves under:
- repeated execution of the same task
- constrained or reduced input context
- cross-system comparison
- interaction between confidence, agreement and explanation signals
The focus is on behavioural patterns rather than single results.
WHAT THIS REPORT DOES NOT DO
This report does not:
- assess real-world discrimination or hiring outcomes
- evaluate employer intent
- provide compliance or legal determinations
- measure model accuracy against ground truth
The analysis focuses on system behaviour under controlled conditions.
WHO IS THIS FOR
This report is intended for:
- researchers examining AI system behaviour
- audit, risk and assurance professionals
- policymakers and regulators
- practitioners working with AI decision-support systems
RESEARCH CONTEXT
This report forms Phase 4 of the Agents at Work series:
- Phase 1 — detection of age-adjacent language
- Phase 2 — interpretation of that language
- Phase 3 — behavioural variation under repetition
- Phase 4 — evaluation of how that behaviour holds under structured conditions
WHY THIS MATTERS
AI systems are often trusted based on individual outputs.
This report shows that reliability cannot be inferred from a single result.
A system may produce outputs that are clear, confident and well explained, while the underlying judgement does not remain stable.
LICENCE AND USAGE
© 2026 Imogen Hull – Beyond the Average
Licensed under Creative Commons CC BY-NC-ND 4.0.
The underlying methodology, agent design, prompts and analytical framework remain proprietary and are not licensed for reuse.