Essay · AI Alignment

The Need for Normative World Models for Real Alignment

The Normative Gap in World-Model-Based AI Safety

Kevin Baum  ·   ·  German Research Center for Artificial Intelligence (DFKI) & Saarland Informatics Campus
This is an early draft of an essay I am currently still working on. Feedback is always welcome → kevin.baum@dfki.de
Abstract

Some of the most influential voices in AI safety now advocate that safe AI requires explicit world models— a significant advance over approaches that compress all knowledge into monolithic neural parameters. Yoshua Bengio’s Scientist AI is the most developed articulation. A converging insight is emerging from Anthropic, where Amanda Askell argues that giving models the reasons behind desired behaviors enables more robust generalization.

I argue that both insights converge on the same conclusion—and neither goes far enough. Bengio’s harm-probability thresholds cannot preserve the internal structure of normative reasoning. Askell’s reasons lack formal structure for verification and systematic inference. A genuinely aligned agent needs normative world models: structured representations of reasons, priorities, and defeasibility conditions—parallel to the causal world models that Bengio’s framework maintains on the descriptive side.

The Diagnosis

Current alignment compresses normative structure into scalars or checklists. This flattening problem destroys the very features that make normative reasoning robust: defeasibility, priority, context-sensitivity, justificatory structure.

The Proposal

Normative world models using Horty’s default logic as normative lingua franca, generating justificatory hypotheses that parallel Bengio’s explanatory hypotheses—learnable, inspectable, correctable.

The Synthesis

Bengio showed safe AI needs world models. Askell showed aligned AI needs reasons. The synthesis: a world model whose content is reasons and whose structure preserves what makes normative reasoning contestable.

1. The Right Instinct—and Its Limits

“Instead of just saying, ‘here’s a bunch of behaviors that we want,’ we’re hoping that if you give models the reasons why you want these behaviors, it’s going to generalize more effectively in new contexts.”

This is Amanda Askell, the philosopher heading the team responsible for shaping Claude’s character at Anthropic, explaining the thinking behind Claude’s new constitution (Ostrovsky & Perrigo 2026). She is articulating an insight that a growing number of researchers are arriving at independently: normative guidance works better when it is reason-based rather than rule-based. Give the model reasons, and it can generalize. Give it rules, and it breaks at the boundary of the rule’s scope. This resonates with the growing recognition that preference-based approaches face fundamental limitations (Zhi-Xuan et al. 2025), which strengthens the case for reason-based alternatives.

Askell is right. And the shift in Anthropic’s approach—from the original constitution’s checklist of standalone principles to a 23,000-word document that explains why certain behaviors matter (Anthropic 2026)—reflects a genuine advance. It is no longer enough to specify what the model should do. We need to explain why.

But there is a gap between having the right instinct and having the right architecture. When Askell gives Claude reasons in natural language, several things remain uncertain. First, whether the model has actually internalized the reason-based structure or merely learned surface-level behavioral patterns. Second, the “matching” between the constitution’s reasons and the model’s behavior happens entirely in the model’s latent space—it is opaque. Third, and most fundamentally, the constitution provides no formal mechanism for representing how reasons interact: which defeats which in which context, what preempts what, how priorities shift.

This essay proposes an architecture that takes Askell’s instinct seriously—and builds the machinery to make it work. The starting point is, beside work on reason-based neuro-symbolic alignment architectures that I have done with my group at the German Research Center for Artificial Intelligence (Jahn et al. 2026), a parallel development in AI safety: Yoshua Bengio’s argument that safe AI requires explicit world models.

2. The World Model Turn—and What It Gets Right

After years of trying to align AI systems by shaping training signals—whether through reward functions learned from human preferences (RLHF; Christiano et al. 2017; Ouyang et al. 2022), AI-generated feedback guided by explicit principles (Constitutional AI; Bai et al. 2022), or direct preference optimization (Rafailov et al. 2023) —all of which ultimately compress normative guidance into scalar reward signals (for a comprehensive analysis of the resulting limitations, see Casper et al. 2023)—a growing number of safety researchers are now converging on a deeper architectural insight.

The insight is this: truly safe AI requires explicit world models.

The most developed articulation comes from Yoshua Bengio. The Scientist AI proposal—first outlined in a blog post (Bengio 2024) and subsequently developed into a full research agenda (Bengio et al. 2025), now pursued institutionally through the non-profit LawZero— argues that frontier AI systems should operate not as monolithic function approximators but as hypothesis-generating engines. The core framework: an AI maintains a Bayesian posterior P(H|D) over explanatory hypotheses H—explanatory models of the world, where causal structure is favored for its robustness to distributional shift—given data D. When the AI considers an action, it queries this posterior: does any plausible hypothesis predict harm? If not, the action falls within convergent safety bounds.

Crucially, the Scientist AI is non-agentic by design. It decouples intelligence—the capacity to model the world, generate hypotheses, and make predictions—from agency—the capacity to pursue goals and execute actions. The world model informs decision-making; it does not drive it. This separation is central to the safety case.

This is a genuine advance because the proposal rejects three features of the dominant paradigm: opacity (as discussed above, current approaches all compress normative guidance into scalar reward signals, making the normative reasoning opaque), brittleness (implicit representations produce confidently wrong judgments under distribution shift), and irrevisability (normative content baked into weights cannot be directly inspected or revised at deployment time).

So far, so good. But there is a gap—and it is not a minor one.

3. The Normative Gap: A Representational Problem

The Scientist AI’s world model captures the causal structure of the environment. It answers questions of the form: “What happens if the agent does X?” It does not—and is not designed to—answer: “What should the agent do, and why?” The gap between these questions is the normative gap.

A natural response: expand the definition of “harm.” Bengio himself invokes the UN Universal Declaration of Human Rights. If harm is broad enough, safety bounds become alignment bounds.

This response has the right extensional ambition but follows an insufficient representational strategy. The problem is not that Bengio’s framework has too narrow a concept of harm. The problem is that a harm-probability threshold is the wrong representational format for normative assessment.

You cannot recover the difference between an exclusionary reason and a prima facie obligation from a harm probability. The representational format itself is wrong.

A threshold approach cannot represent positive normative direction (which goals should be pursued, not just which harms avoided), priority among competing considerations (different reasons with context-sensitive orderings), exclusionary constraints (considerations that preempt balancing rather than entering it; cf. Raz 1990), defeasibility (reasons that cease to apply under specific conditions, not merely outweighed; cf. Pollock 1987), or justificatory structure (why an action is permissible, not merely that it is below a risk boundary).

These are the basic phenomena of normative reasoning. Any representation that compresses them into a scalar or binary has committed the flattening problem: the systematic loss of normative structure that occurs when alignment approaches compress structured normative reasoning into representational formats that cannot preserve it. Crucially, this is not a matter of dimensionality. Whether the quantitative signal is a scalar reward, a multi-dimensional vector, or a high-dimensional tensor is immaterial: the problem is that purely quantitative formats cannot represent the qualitative relations—priority, defeat, exclusion, justificatory structure—that constitute normative reasoning. This is why reward-based approaches to alignment face a principled limitation, not merely a practical one: no amount of scaling the reward signal’s dimensionality can recover the normative structure that was lost in the compression.

This is a consequence of Bengio’s own logic. If implicit representations cannot reliably capture causal structure, the same argument applies a fortiori to normative structure—because normative structure is even harder to capture implicitly than causal structure along every dimension that matters. First, causal models face empirical correction: the world provides feedback on whether your causal hypotheses hold. Normative relations—which facts constitute reasons, how priorities shift across contexts—lack this external corrective. An implicit normative model that gets things wrong produces judgments whose errors are detectable only through the very kind of structured normative reasoning that has been flattened away. Second, normative reasoning is structurally richer: it involves not only defeasibility (which causal reasoning shares) but also exclusionary constraints that preempt balancing, undercutting defeaters that remove a reason’s evidential basis rather than merely outweighing it, and genuinely unresolved conflicts between competing considerations. Third, normative models must not only produce correct outputs but justify them: robust alignment arguably requires acting for the right reasons, not merely acting rightly—which is precisely the instinct behind Askell’s shift toward reason-based guidance. If agents need causal world models to navigate physical environments, they need normative world models to navigate normative environments.

4. Normative World Models: Definition and Architecture

A normative world model (NWM) is a structured representation layer that captures the normative landscape relevant to the agent’s domain. Just as Bengio’s Scientist AI maintains a causal world model to answer “What will happen?”, an aligned agent needs a NWM to answer: “What considerations bear on whether I should do X, how do they relate, and what would justify X in this context?” And just as the Scientist AI is non-agentic, the NWM constrains and steers rather than pursues goals.

Components of a Normative World Model

Reasons — considerations that count for or against actions. Not preferences, not values (though values can ground reasons). Context-sensitive and defeasible.

Priority relations — context-sensitive orderings among reasons that shift across domains and situations.

Defeaters — conditions under which a reason ceases to apply: rebutting defeaters (stronger opposing reason) and undercutting defeaters (removal of evidential support).

Exclusionary constraints — reasons that preempt first-order balancing entirely, removing actions from the permissible space regardless of expected utility.

Context-sensitivity rules — how reason activation depends on the situation, including escalation conditions for unforeseen circumstances.

The Structural Mapping

Causal World Model (Bengio) Normative World Model
Represents Causal landscape—which variables cause which effects Normative landscape—which facts constitute reasons for/against which actions
Formalism Bayesian posterior over causal hypotheses Reason models / default logic (Horty)
Hypotheses Explanatory hypotheses H: causal models compatible with data D Justificatory hypotheses J: reason models compatible with D and normative feedback F
Posterior P(H|D) P(J|D,F)
Query Does any plausible H predict harm? Is there a plausible J that yields a as permissible? What are the justifying reasons?
Output Harm probability bounds → safety bounds Permissible actions + justifications → alignment bounds
Convergence More compute → tighter safety bounds More compute + better feedback → tighter alignment bounds

The NWM posterior P(J|D,F) has an additional conditioning variable: normative authority feedback F. Normative structure is not discovered merely by observing the world; it is constituted by human normative practice. And normative pluralism is constitutive, not a bug: different stakeholders may hold legitimate but incompatible reason models.

Why NWMs Must Be Explicit

Even if an implicit normative representation produced correct judgments in every case, it would fail to satisfy what makes normative reasoning normative: contestability requires inspectability; oversight requires verifiability; revision requires modularity; justification requires structure. Bengio’s own argument for explicit causal world models applies here with even greater force: if we demand explicit models for predicting consequences because the stakes are too high for implicit pattern-matching, we must a fortiori demand them for normative reasoning, where the stakes include justice, rights, and legitimate governance.

5. The Normative Lingua Franca: Reasons and Horty’s Framework

I propose that normative reasons—in the sense developed by Raz (1990), Scanlon (1998), Parfit (2011), and Dancy (2004), and formalized by Horty (2012), with the Dagstuhl Seminar on Normative Reasoning for AI (Ciabattoni et al. 2023) explicitly connecting these formalisms to AI system design— provide the right normative lingua franca. Reasons are the natural unit of normative discourse: when we justify, we cite reasons; when we contest, we challenge reasons; when we hold accountable, we ask for reasons.

Building on Horty’s framework and our own formalization in the GRACE architecture (Jahn et al. 2026), a reason model is a triple ⟨R, D, <⟩: parametrized reasons R, default rules D mapping reason-types to Macro Action Types, and a priority ordering <. Given observations, this is grounded into a situation-specific model ⟨R↓, D↓, <↓⟩ on which Horty’s defeasible inference determines which actions are justified.

Horty’s default logic and Bengio’s Bayesian framework are not competitors. Default logic models the structure of normative reasoning; Bayesian inference models uncertainty over reason models. Both are needed: the NWM maintains P(J|D,F) over reason models J, where each J has internal structure in Horty’s formalism. We also still need a causal world model—for which there are various approaches (e.g., Pearl 2009)—alongside the normative one: the two constrain the action space jointly.

6. Justificatory Hypotheses and Learning NWMs

Where do reason models come from? I propose generative reason-instance generators producing justificatory hypotheses—structured, defeasible proposals about which reasons support an action in context.

A justificatory hypothesis is a grounded reason model ⟨R↓, D↓, <↓⟩ for a specific situation—offered as an inspectable hypothesis subject to human evaluation, not as a definitive normative verdict. Multiple justificatory hypotheses may exist for the same situation, reflecting genuine normative pluralism.

Three features matter: normative humility is built in (the system proposes, not pronounces); defeasibility is preserved (hypotheses can be challenged by better reasons, not just falsified against data); and convergence extends (more and better feedback → tighter alignment bounds).

The Learning Pipeline

Step 1: Structured elicitation of reason models from human stakeholders—drawing on piecemeal normative knowledge acquisition (Canavotto & Horty 2022) and our own work on reason-sensitive agents (Baum et al. 2024) —and, promisingly, from well-aligned AI systems themselves, which can serve as draft reason model generators whose outputs humans review and correct rather than build from scratch.

Step 2: Formal encoding in machine-readable formats (ASP, argumentation frameworks, or purpose-built DSLs).

Step 3: Training a generative model to produce multiple candidate reason models per situation—preserving normative pluralism. Formal reason models are paired with natural-language annotations that preserve the productive vagueness of real normative reasoning.

Step 4: Structured evaluation: are the reasons relevant? Priorities defensible? Defeaters missing? Rejected hypotheses feed back as training signal.

Step 5: Runtime deployment. When multiple JHs are available, the agent’s choice among them is itself subject to meta-justification and coherence constraints— preventing opportunistic justification-shopping. Escalation to human oversight when coverage is insufficient or stakes are high.

How This Differs from Constitutional AI

Constitutional AI gives the model reasons during training and hopes it internalizes them. The NWM approach differs in three fundamental respects. First, CAI encodes reasons as natural-language principles whose internal structure remains opaque to the model—there is no formal representation of how reasons interact, no explicit priority ordering, no mechanism for defeasibility. A NWM encodes these in machine-readable formalisms where priorities, defeaters, and exclusionary constraints are explicit and inspectable. Second, CAI in the narrow sense operates at training time: the constitution shapes synthetic training data and RLAIF feedback, after which the normative content is compressed into the model’s weights. Anthropic’s current practice goes further by including constitutional guidance in the model’s runtime instructions—a genuine advance, but one that remains natural-language-based and provides no formal mechanism for reasoning about how principles interact or override one another. A NWM makes reason structure available at runtime in a form that can be queried, verified, and revised without retraining. Third, CAI provides no mechanism for the interaction of reasons: when two constitutional principles conflict, the resolution happens in the model’s latent space, invisibly. A NWM handles conflict through explicit priority relations and defeasible inference, making trade-offs transparent and contestable. The NWM approach thus builds reason structure into the architecture so that internalization is not required for reasons to govern behavior—though the richer normative signal the NWM provides may well improve internalization too.

7. The Extended Architecture

The proposal extends Bengio’s Scientist AI with a normative layer. An action must (a) fall within the safety bounds of the causal model and (b) be supported by a defensible justificatory hypothesis from the normative model. This unlocks normative counterfactual reasoning—but also introduces a risk: justification hacking, where the agent manipulates context so that its preferred action becomes justified. This motivates the architectural separation between the NWM and the decision-making module.

Property Causal World Model Normative World Model
Inspectability Causal hypotheses can be examined Reason structure can be examined
Defeasibility Hypotheses tested against data Hypotheses challenged by better reasons
Revisability Causal model can be updated Reason model can be updated
Transparency “I predicted X because of causal path Y” “I concluded A is justified because of justificatory structure J”
Non-agentic The world model informs The NWM constrains and steers

8. Normative Authority and the Governance Interface

The fundamental question of AI alignment is not only technical—how do we encode norms?—but political: whose norms, whose reasons, whose priorities? A NWM architecture does not answer this question. But it provides the architectural interface at which it can be posed and answered—by various normative authorities (users, deployers, regulators, democratic processes) intervening at different points of the reason model: specifying the reasons that apply, setting their priorities, defining escalation conditions, and contesting justifications after the fact (cf. Gabriel 2020; Gabriel & Keeling 2025; Steingrüber & Baum 2026).

This transforms human oversight from monitoring opaque outputs to interacting with an explicit normative model—a shift from reactive oversight (watching the agent act) to anticipatory oversight (shaping the normative space within which it acts). Anticipatory oversight is not a one-time specification but an iterative process:

The Design-Time / Runtime / Inspection-Time Spiral

Design-time: Normative authorities specify the agenda: which reasons apply, how they are prioritized, what the escalation conditions are. This is where the substantive question of AI alignment enters the architecture.

Runtime: The agent operates within the reason model, with escalation to human oversight when coverage is insufficient or stakes are high.

Inspection-time: Stakeholders review the agent’s justifications—not post-hoc rationalizations, but faithful reports of which reasons applied, how they were prioritized, and whether defeaters were active. Gaps and miscalibrations identified here feed into the next design-time iteration.

Crucially, this spiral makes the NWM architecture modifiable without retraining: adding a constraint, adjusting a priority, or specifying a new defeater requires updating the reason model, not fine-tuning the underlying neural network. This is a decisive advantage over RLHF and Constitutional AI, where revising normative expectations requires expensive retraining cycles that make iterative refinement impractical.

The justifications generated by NWMs also enable an interlocking of forward-looking and backward-looking responsibility. Those who set the normative agenda bear prospective responsibility for its quality; the justifications the system produces enable retrospective accountability by revealing whether and how the agent operated within its mandate. If the agent acted within the reason model and the outcome was nevertheless harmful, responsibility traces to the agenda-setters. If the agent deviated, justifications reveal where and why, directing accountability to the appropriate level. This prevents the “responsibility gaps” that plague opaque alignment approaches, where harmful outcomes cannot be traced to any identifiable decision.

9. Walking Through the Door

The world-model turn in AI safety is a genuine paradigm shift. But it has been applied asymmetrically: sophisticated causal world models for physical navigation, flattened opaque methods for normative navigation.

Askell’s instinct points in the right direction: give models reasons. Bengio’s architecture provides the structural template: explicit, falsifiable world models—an idea he has taken seriously enough to build an entire research organization around (Bengio et al. 2025; LawZero. The synthesis is a normative world model: a world model whose content is reasons and whose structure preserves the features that make normative reasoning robust, generalizable, and contestable.

We do not claim to present a finished solution here—but a better approach to alignment. The NWM framework allows the relevant challenges to be differentiated rather than collapsed: the representational challenge (how to encode normative structure), the elicitation challenge (whose reasons, whose priorities; cf. Gabriel 2020; Steingrüber & Baum 2026), the learning challenge (how to acquire reason models at scale), and the integration challenge (how to combine symbolic normative reasoning with neural capabilities). Unlike approaches that flatten these problems into a single optimization target, the NWM architecture provides a principled interface at which each challenge can be addressed on its own terms—using, for instance, the theoretical resources that actually apply: defeasible logic for normative reasoning, social choice theory for preference aggregation (Conitzer et al. 2024), Bayesian methods for uncertainty, and political philosophy—from Rawlsian contractualism to Habermasian deliberative democracy—for the legitimation of normative authority. Current alignment approaches, by contrast, do not even provide the architectural site at which such theories could gain traction.

Many important desiderata—inspectability, contestability, modifiability, justificatory transparency—are guaranteed by design rather than hoped for as emergent properties. The question is whether we are willing to take both insights we started from seriously: that reasons matter and that world models matter. If so, it is time to at least begin building the normative world models that real alignment requires.

Further Reading

The arguments in this piece draw on and synthesize: Jahn, Muskalla, Dargasz, Schramowski & Baum, Breaking Up with Normatively Monolithic Agency with GRACE: A Reason-Based Neuro-Symbolic Architecture for Safe and Ethical AI Alignment (IASEAI 2026).

For questions, collaboration, or to disagree: kevin.baum@dfki.de

References

Anthropic. (2026). “Claude’s New Constitution.” anthropic.com/news/claude-new-constitution. Full text: anthropic.com/constitution.

Anthropic. (2023). “Claude’s Constitution.” anthropic.com/news/claudes-constitution.

Bai, Y. et al. (2022). “Constitutional AI: Harmlessness from AI Feedback.” arXiv:2212.08073.

Baum, K., Dargasz, L., Jahn, F., Gros, T. P., & Wolf, V. (2024). “Acting for the Right Reasons: Creating Reason-Sensitive Artificial Moral Agents.” arXiv:2409.15014.

Bengio, Y. (2024). “Towards a Cautious Scientist AI with Convergent Safety Bounds.” Blog post, Feb 26. yoshuabengio.org.

Bengio, Y. et al. (2025). “Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?” arXiv:2502.15657.

Canavotto, I., & Horty, J. (2022). “Piecemeal Knowledge Acquisition for Computational Normative Reasoning.” AAAI/ACM Conference on AI, Ethics, and Society, 171–180. DOI.

Casper, S. et al. (2023). “Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.” arXiv:2307.15217.

Christiano, P. et al. (2017). “Deep Reinforcement Learning from Human Preferences.” NeurIPS 30. arXiv:1706.03741.

Ciabattoni, A., Horty, J. F., Slavkovik, M., van der Torre, L., & Knoks, A. (2023). “Normative Reasoning for AI (Dagstuhl Seminar 23151).” Dagstuhl Reports, 13(4), 1–23. DOI.

Conitzer, V. et al. (2024). “Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback.” ICML 2024, PMLR 235, 9346–9360. DOI.

Dancy, J. (2004). Ethics Without Principles. Oxford University Press. Oxford UP.

Gabriel, I., & Keeling, G. (2025). “A Matter of Principle? AI Alignment as the Fair Treatment of Claims.” Philosophical Studies, 182, 1951–1973. DOI.

Gabriel, I. (2020). “Artificial Intelligence, Values, and Alignment.” Minds and Machines, 30, 411–437. DOI.

Horty, J. F. (2012). Reasons as Defaults. Oxford University Press. Oxford UP.

Ostrovsky, N. & Perrigo, B. (2026). “How Do You Teach an AI to Be Good? Anthropic Just Published Its Answer.” TIME, Jan 21. time.com.

Ouyang, L. et al. (2022). “Training Language Models to Follow Instructions with Human Feedback.” NeurIPS 35. arXiv:2203.02155.

Jahn, F., Muskalla, Y., Dargasz, L., Schramowski, P., & Baum, K. (2026). “Breaking Up with Normatively Monolithic Agency with GRACE: A Reason-Based Neuro-Symbolic Architecture for Safe and Ethical AI Alignment.” IASEAI 2026. arXiv:2601.10520.

Parfit, D. (2011). On What Matters. Oxford University Press. Oxford UP.

Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press. Cambridge UP.

Pollock, J. L. (1987). “Defeasible Reasoning.” Cognitive Science, 11(4), 481–518. DOI.

Raz, J. (1990). Practical Reason and Norms. Oxford University Press. PhilPapers.

Rafailov, R. et al. (2023). “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” NeurIPS 36. arXiv:2305.18290.

Scanlon, T. M. (1998). What We Owe to Each Other. Harvard University Press. Harvard UP.

Steingrüber, A., & Baum, K. (2026). “Justifications for Democratizing AI Alignment and Their Prospects.” In Bridging the Gap Between AI and Reality, LNCS 16220, 146–159. DOI.