Agents at Work: Phase 3 Report
A behavioural audit of how AI judgement changes under repetition, ambiguity and constraint.
Introduces the behavioural framework used to evaluate AI judgement.
Description
This report presents Phase 3 of the Agents at Work research series, examining how an AI system behaves when asked to evaluate age-related bias in recruitment language under repeated and constrained conditions.
Building on earlier phases, which examined where age-related signals appear and how they are interpreted, Phase 3 focuses on how judgement behaves when the same task is performed multiple times.
The report introduces a behavioural framework for evaluating AI systems beyond single outputs, examining patterns of stability, variation and signal response over repeated evaluation.
What This Report Does
Phase 3 examines how AI judgement behaves under:
- repeated execution of the same task
- ambiguous or borderline language
- partial or degraded input context
- variation in internal signals such as confidence and agreement
The report applies a structured behavioural audit to analyse:
- run-to-run judgement stability
- variation in explanations
- confidence behaviour under uncertainty
- consistency of cue identification
- cross-model agreement
- responsiveness of internal self-review signals
- sensitivity to truncated input
The focus is on observable behaviour rather than individual results.
What This Report Does Not Do
This report does not:
- assess real-world discrimination or hiring outcomes
- determine employer intent
- provide compliance or legal determinations
- measure model accuracy against ground truth
The analysis focuses on system behaviour under controlled conditions.
Who This Is For
This report is intended for:
- researchers examining AI system behaviour
- audit, risk and assurance professionals
- policymakers and regulators
- practitioners working with AI decision-support systems
Research Context
This report forms Phase 3 of the Agents at Work series.
- Phase 1 examines detection of age-related signals
- Phase 2 examines interpretation of those signals
- Phase 3 examines how judgement behaves under repetition and constraint
This phase establishes the behavioural perspective that underpins later evaluation work.
Why This Matters
AI systems are often trusted based on individual outputs and fluent explanations.
Phase 3 shows that these signals do not fully reflect how a system behaves over time. Reliability emerges from patterns of behaviour, not from a single result.
Licence and Usage
© 2026 Imogen Hull – Beyond the Average
Licensed under Creative Commons CC BY-NC-ND 4.0.
The underlying methodology, agent design and analytical framework remain proprietary.