We are living in a time, witnessed by OpenClaw, Manus AI and other developments, in which AI agents increasingly need to cooperate—with each other, with human principals and counterparts, and within institutional structures, some of which still need to be created or emerge. I argue that current agent architectures lack the normative structure that cooperation requires. Cooperative commitments demand more than behavioral regularities; they demand inspectable, structured, and credibly enforceable normative constraints that interaction partners can verify, negotiate, and trust in a scalable manner.
This essay argues that our neuro-symbolic GRACE architecture (Governor for Reason-Aligned ContainmEnt)—originally designed for safe and ethical single-agent containment—provides an architectural substrate that can push the boundaries of cooperative AI. Its reason-based, defeasible normative constraints can be extended to implement cooperative contracts, institutional compliance, and multi-agent governance. The central theoretical contribution is a new account of credible commitment through structured defeasibility: commitment not as the removal of options, but as a normatively justified structure of reasons that specifies precisely which considerations can and cannot override a cooperative agreement.
1. The Problem: Cooperation Demands Normative Surfaces
Problems of cooperation—in which agents have opportunities to improve their joint utility but struggle to do so for multi-agent induced issues—are ubiquitous. As AI agents are deployed in increasingly autonomous roles, they encounter cooperation problems at every scale: from pairs of agents negotiating resource access, to hierarchies of agents delegating and coordinating, to ecosystems of agents embedded in institutional structures.
A growing research community, catalyzed by the Open Problems in Cooperative AI agenda (Dafoe et al. 2020), has organized this challenge around four cooperative capabilities: understanding other agents, communicating effectively, constructing credible commitments, and building institutions that structure cooperative behavior.
The first two capabilities—understanding and communication—have received extensive attention in multi-agent reinforcement learning, emergent communication, and theory-of-mind research. The latter two—commitment and institutions—are, it seems to me, still less technically developed, despite being widely recognized as the harder and arguably more important challenges. Fearon (2020) conjectures that institutional design—“changing the game”—is the deeper cooperation problem.
I argue that a current bottleneck is architectural. Currently dominant agent architectures—LLM-based agents as an especially popular example—are normatively opaque: their behavioral dispositions are compressed into neural weights that cannot be inspected, contested, negotiated, iteratively adapted, verified, or formally constrained. They lack what I call normative surfaces—structured interfaces through which cooperative commitments can be expressed, verified, and enforced.
You cannot build credible commitments on opaque behavioral tendencies. You cannot implement institutional compliance when the agent’s “values” are inaccessible latent representations. You cannot govern what you cannot structure.
2. GRACE: From Single-Agent Containment to Cooperative Infrastructure
The GRACE architecture (Jahn et al. 2026) was designed to solve the flattening problem in single-agent alignment: the systematic loss of normative structure that occurs when alignment approaches compress structured normative reasoning into quantitative reward signals. GRACE addresses this by separating normative deliberation from instrumental decision-making. The core agent’s own architecture is not modified—rather, GRACE wraps around it as a governor, enriching the agent’s input with normative information (which MATs are permissible and why) and providing an extended decision-making context within which the core agent operates. Think of it as equipping the agent with a normative scaffold, not rewiring its internals.
At its heart lies a Reason Theory ⟨R, D, <⟩—a formal structure consisting of a set of parametrized reason types R (observable facts that count for or against actions), a set of default rules D mapping reason types to Macro Action Types (MATs—high-level, abstract behavioral categories like “protect patient data” or “report foreseeable harm”), and a partial priority ordering < over defaults that captures which reasons take precedence when they conflict. Crucially, this formalism—extending Horty’s (2012) default logic—supports defeasible reasoning: defaults can be defeated by higher-priority countervailing considerations, and the resulting inferences are formally well-characterized.
GRACE decomposes the agent into three internal components, complemented by an external normative authority:
The Moral Module (MM) is the symbolic reasoning core. It interprets observations for moral relevance, applies the current Reason Theory to derive the set of permissible MATs via Horty’s inference mechanism, and provides justifications explaining why those MATs are permissible. It also processes feedback from the Moral Advisor (see below) to refine the Reason Theory over time—adding new defaults, adjusting priority orderings, or introducing previously unrecognized reason types.
The Decision-Making Module (DMM) encapsulates the core AI agent’s instrumental decision-making capabilities—planning, optimization, goal pursuit. The DMM receives the set of permissible MATs from the MM and supports the encapsulated core agent to select the instrumentally best primitive action that accords with a permissible MAT. The core agent itself is not modified—it is governed.
The Guard monitors the primitive actions proposed by the DMM and verifies that they accord with a permissible MAT. If a proposed action fails this check, the Guard blocks it and triggers re-decision. The Guard implements exclusionary constraints in the sense of Raz (1990)—second-order reasons that preempt first-order balancing, removing certain actions from the permissible space regardless of their expected utility.
Complementing these internal components, the Moral Advisor (MA) is an external normative authority—a human overseer, expert committee, or domain-specific normative system—that provides case-based feedback on agent behavior. The MA is not part of the agent itself but part of the broader governance arrangement within which the agent operates. When the MA detects impermissible behavior, it provides structured feedback: not just “that was wrong” but “you should have followed MAT φ for reason R.” This feedback drives iterative refinement of the Reason Theory.
This architecture was conceived for safe containment and ethical alignment. But when we start thinking about anticipatory human oversight and control of agents and multi-agent organisations, we found that its properties turn out to be precisely what cooperative AI needs:
Inspectability. The Reason Theory is a structured, symbolic representation that can be read, audited, and verified by external parties—human overseers, other agents, institutional bodies. This is the normative surface that cooperation requires. While we discuss in more detail below how GRACE can be used for cooperation challenges, the central properties are:
Modularity. Normative constraints can be added to the Reason Theory without necessartily retraining the core agent. Cooperative commitments can be loaded as additional default rules with specified priority relations.
Defeasibility. Constraints are not rigid rules but structured defaults with priority orderings and defeat conditions. This enables sophisticated normative reasoning about when commitments do and do not apply.
Justificatory transparency. The MM doesn’t just filter actions; it produces justifications—structured explanations of why a given action is permissible. This makes cooperative behavior legible to interaction partners.
3. Cooperative Contracts as Reason Modules
The central architectural move is straightforward: if GRACE already has the machinery for structured, inspectable, defeasible normative constraints, and cooperative commitments are structured normative constraints, then cooperative contracts can be implemented as additional default rules within the Reason Theory—or, in a multi-layer extension that allows for reason theories for various normative domains (like personal preferences, cultural constraints, legal obligations etc.), as a dedicated contract reason layer with its own priority relations relative to the base Reason Theory (or further reason theories for other normative domains).
Consider two agents, A and B, facing a cooperation problem. Each agent’s GRACE MM already contains a Reason Theory encoding its normative constraints. A cooperative contract between A and B amounts to both agents loading a shared set of default rules—encoding the terms of their agreement—into their respective Reason Theories, with agreed-upon priority orderings and defeat conditions, in way that is tampering-prove. Here is a classical textbook example:
Two agents agree to cooperate on a stag hunt. The contract adds a default rule to each agent’s Reason Theory: δcontract : “cooperation with partner B is in progress” → φpursue-stag. This default has high priority within the game’s strategic context. A hare—a tempting alternative payoff within the same strategic structure—does not defeat this default, because game-internal incentives to defect are precisely what the commitment is designed to overcome.
But the contract also specifies appropriate defeat conditions, for instance safety or emergency considerations: a wolf (physical safety concern triggers a higher-priority safety default), a thunderstorm (environmental emergency), or observing the partner commit violence against a third party (a moral emergency that the MM infers as an exclusionary constraint, which the Guard then enforces) can defeat the cooperative default—because these are considerations at a different level than the strategic incentives the commitment addresses.
This example illustrates the central theoretical contribution: credible commitment through structured defeasibility.
Commitment as Reason Architecture, Not Constraint Removal
The standard account of credible commitment, following Schelling (1960), conceives of it as constraining self-interested future choice by removing or (dis-)incentivizing options—throwing away the steering wheel, or putting your reputation on the line by making a promise. For a credibly committed agent, it is impossible or irrational to defect.
This account is too idealized for the cooperative challenges AI agents face. In real-world cooperation problems, it can still be rational for credibly committed agents to defect, because (dis-)incentivization only generates defeasible reasons—you break a promise in order to save a life. Conversely, reasons for defection matter for the reputation judgments of other players—it would be irrational to consider an agent less reputable because they broke a promise to save a life. The challenge is: how do you make a commitment credible if everyone knows that it can be overridden?
The answer is that credibility comes not just from the impossibility or (dis-)incentivization of options but also from the structure of what can defeat the commitment. A commitment is credible when its defeat conditions are normatively justified—when every agent in the cooperative arrangement would agree that those conditions warrant override.
This connects to certain versions of contractualism in moral philosophy. A principle is justified, on Scanlon’s (1998) account, if it cannot be reasonably rejected. Applied here: a cooperative contract’s defeat conditions are credible if no reasonable (AI) agent would reject them. A defeat condition that says “you may break the commitment when there is a wolf” is one no reasonable cooperator would reject. A defeat condition that says “you may break it when you see a hare” is one that would be rejected—because it undermines the cooperative structure itself.
This account is made possible by a reason-based architecture. A reward-based system can in principle assign different expected utilities to these two cases. But a purely quantitative signal cannot represent why one constitutes a legitimate override and the other does not—not in a way that is separable from instrumental considerations, inspectable by external parties, or auditable for consistency. In a reward function, normative priority, estimated outcome probability, and instrumental value collapse into a single number; one cannot recover from that number whether the agent broke a commitment because a genuinely overriding consideration applied or because defection happened to score marginally higher. GRACE’s Reason Theory keeps these dimensions apart: the reasons for override are explicit, the priority relations that license defeat are formal, and the justification is available for external scrutiny. Commitment becomes a property of the agent implied by its normative architecture and its content, not an external mechanism imposed upon it.
4. Trust Brokers and Normative Interfaces
If agents can load cooperative contracts into their Reason Theories, who controls this process? The contract layer needs a well-defined write-access interface—a structured protocol specifying who can modify which parts of an agent’s Reason Theory, under what conditions.
GRACE’s architecture already implies something like a trusted authority: the Moral Advisor (MA) serves as external normative authority providing structured feedback that drives Reason Theory refinement. Extending this to inter-agent cooperation requires specifying who plays the MA-like role for a dyadic or multilateral commitment. The answer should be pluralistic and context-dependent:
A human principal or anticipatory human overseer (Baum, Langer et al. 2026) can authorize and inspect cooperative contracts, serving as the trust anchor in high-stakes settings—functioning as a Moral Advisor for the cooperative relationship, not just the individual agent. A dedicated mediator agent—itself governed by a GRACE architecture whose Reason Theory encodes the procedural norms of mediation—can facilitate contract negotiation and verification. An institutional body—a regulatory entity, a governance protocol, or a new kind of AI-agent-based institution—can impose default rules that participating agents must load as compliance layers in their Reason Theories. And in some cases, agents may jointly spawn a mediator: agents A and B, recognizing they face a trust game, create agent C whose GRACE architecture is defined by A and B’s agreed contract terms, with C’s Guard enforcing the contract and C’s MM handling compliance verification.
Architecturally, what matters is that the Reason Theory has a cryptographic normative interface—a structured access-control layer specifying: which external parties can write to which default rules and priority orderings; which modifications require mutual consent; which layers are inherited by spawned sub-agents; and which parts are inspectable by cooperation partners or institutional auditors. (This interface could need to support something analogous to zero-knowledge verification to avoid gamefication or perverse incentives: demonstrating that your contract-layer defaults satisfy certain properties without exposing your full Reason Theory.)
5. The DMM Extension: Strategic Awareness and Normative Reflection
For LLM-based agents, the DMM—which encapsulates the core agent—could enriche the core agent's cognitive architecture with additional cognitive resources that enable normative reflection. This means making the MM’s outputs fertile as inputs to the agent’s own reasoning, creating a feedback loop between containment structure and deliberation.
Two capabilities emerge from this extension:
Introspective access to the containment structure. The DMM can take the current Reason Theory, the context-sensitively derived set of permissible MATs, and the corresponding justifications into account when selecting actions. This means the agent doesn’t merely comply with cooperative constraints—it can reason about them, understanding why certain MATs are permitted or excluded and how its cooperative commitments relate to its broader normative landscape.
Strategic recognition and cooperative initiative. When an LLM-based core agent encounters a situation with the structure of a coordination problem, a trust game, or a social dilemma, it could (and often already can) recognize this—and the DMM can translate that recognition into governance action: escalating to the Moral Advisor (“I face a cooperation problem that my current Reason Theory doesn’t address”), seeking existing institutional infrastructure, or initiating contract negotiations with other agents. Agents might even jointly spawn a mediator agent to facilitate agreement.
While the second capability is more speculative than the first, both are essential steps toward what elsewhere has been call normative competence (cf. MINTLab) as they points toward a crucial kind of higher-order cooperative competence grounded in practical reasoning: the ability to recognize that cooperation is needed and to seek out the normative infrastructure that makes it possible.
6. Three Layers of Cooperative Governance
The GRACE-based approach to cooperative AI operates at three layers, each corresponding to a different governance challenge:
Layer 1: Institutional Compliance
An institution—e.g., regulator, oversight body, standards organization—writes normative constraints as default rules with priority orderings; these are loaded into the agent’s Reason Theory; the agent complies and can demonstrate compliance through the MM’s justificatory output. This connects directly to the EU AI Act’s requirements for human oversight and transparency. GRACE’s existing architecture already supports this layer.
Layer 2: Bilateral and Multilateral Contracts
Two or more agents negotiate cooperative commitments and each loads shared default rules into a dedicated contract layer of their Reason Theory, with restricted modification rights enforced through cryptographic normative interfaces. The key research question: what formal properties must the contract defaults have to make commitments credible (through structured defeasibility) while preserving the context-sensitivity that makes GRACE responsive? How should write-access be controlled and by whom? What role do trust brokers, mediator agents, and human overseers play?
Layer 3: Emergent Institutional Design
Agents collectively create governance structures—shared norms, coordination protocols, enforcement mechanisms—and load shared normative constraints. This connects to the multi-layer normative architecture (MLNA), where different layers (legal, organizational, team-level, task-specific) may correspond to different levels of governance with clear priority relations between them. Contracts at this level can be inherited by agents spawned by participating agents, propagating cooperative structures through agent hierarchies. This layer represents a research horizon.
7. Differential Progress: Why This Architecture Is Cooperation-Biased
Any advance in cooperative capabilities risks being dual-use. The ability to understand others aids both cooperation and exploitation; commitment devices enable both promises and threats. Dafoe et al. (2020) call for differential progress: advancing cooperative capabilities relative to coercive ones.
GRACE is structurally cooperation-biased. Its constraint architecture is asymmetric: the MM constrains and steers the core agent from outside, while the Guard enforces compliance by verifying that proposed actions accord with permissible MATs. A separate challenge is ensuring that the core agent or DMM cannot subvert this arrangement by tampering with the MM itself—manipulating the Reason Theory or its inference process to make impermissible actions appear permissible. For current and near-future AI agents, this integrity property is manageable through architectural isolation, whether enforced by the Guard, by additional protective components, or by the system design itself; whether it remains robust as agent capabilities scale is important open work. Its justificatory transparency makes collusion harder: an agent governed by GRACE has legible normative commitments via the MM’s justification output, making secret cooperation against third-party interests more difficult to sustain without detection. And its defeasible logic enables the kind of structured accountability that institutions require: not merely monitoring behavior, but auditing the reasons behind behavior—which defaults fired, which priorities resolved the conflict, and whether the justification holds up to external scrutiny.
8. Research Agenda
Contract Reason Modules
Formal specification of contract layers within the Reason Theory framework. What are the well-formedness conditions for contract defaults? How do contract defaults interact with existing defaults and priority orderings? When does a contract default inherit, override, or complement the agent’s base Reason Theory?
Credible Defeasibility
Formal characterization of which defeat conditions preserve commitment credibility. Connection to Scanlon’s contractualism and Levine’s Resource Rational Contractualism (RRC): can a contractualist test determine the legitimate defeat conditions of cooperative contracts? RRC may serve as the operative protocol by which a Moral Advisor—now functioning as a cooperative trust broker—adjudicates contract disputes.
Cooperative Epistemic Grounding
In multi-agent settings, the grounding of reasons—determining whether a morally relevant fact obtains—can itself be a cooperative task. Agents may combine sensor data, contextual knowledge, and inferential capabilities to more reliably determine whether a reason type is triggered or whether a proposed action satisfies a permissible MAT. This distributes the epistemic burden of normative assessment across agents, raising questions about how to structure shared grounding protocols and how to handle disagreement about whether a reason applies.
Imperfect Duties and Collective Discharge
Some normative defaults—particularly those corresponding to imperfect duties like rendering assistance—apply to all agents but often need only be discharged by one (or a few). Without coordination, all agents may simultaneously abandon their current tasks to respond, producing inefficient swarming. This requires mechanisms for negotiating which agent discharges a shared duty: a cooperation problem within the normative architecture itself, connecting GRACE's reason-based framework to classical collective action theory and duty allocation in multi-agent systems.
Cryptographic Normative Interfaces
Access-control protocols for Reason Theories. Zero-knowledge-style verification that an agent’s contract-layer defaults satisfy certain properties without exposing the full Reason Theory. Inheritance protocols for propagating contract defaults to spawned sub-agents.
Trust Broker Architectures
Design of mediator agents and institutional bodies that facilitate contract negotiation and verification. What GRACE architecture should a mediator have? How does a trust broker—functioning as a shared Moral Advisor—verify compliance without full access to each agent’s Reason Theory?
DMM Extension for Strategic Awareness
Enabling LLM-based core agents to reason about their own containment structure (Reason Theory, permissible MATs, justifications) and recognize strategic structures in their environment. Developing protocols for agents to escalate cooperation needs to Moral Advisors, seek institutional infrastructure, or initiate contract negotiations.
Normative Inheritance in Agent Hierarchies
When agent A spawns agent B, which defaults and priority orderings from A’s Reason Theory propagate to B? How do cooperative contracts transfer across agent hierarchies? Connection to organizational theory and delegation.
9. Connections and Collaborators
This research program builds on and connects to several existing threads, some of which we want to highlight here already. The GRACE architecture (Jahn et al. 2026) and the normative world models framework (Baum 2026a) provide the architectural and philosophical foundations. The work on anticipatory human oversight (Sterz, Baum et al. 2024; Langer, Baum & Schlicker 2024) provides the governance framework. The “Filling Normative Modules” pipeline connects to Sydney Levine’s Resource Rational Contractualism (Levine et al. 2025)—offering a mechanism for determining contract content and legitimate defeat conditions through simulated contractualist agreement.
This essay outlines a research direction. It is not a report on completed work. I am actively seeking collaborators and funding partners who share the conviction that cooperative AI needs normative architecture, not just game-theoretic equilibria. If this resonates, I’d welcome a conversation: academia@kevinbaum.de
References
Axelrod, R. (1984). The Evolution of Cooperation. Basic Books. doi:10.1126/science.211.4489.1390
Baum, K. (2026a). The Need for Normative World Models for Real Alignment. Essay, work in progress. kevinbaum.de/vision
Baum, K., Bergs, R., Hermanns, H., Kerstan, S., Langer, M., Lauber-Rönsberg, A., Meinel, P., Stenzel, L., Sterz, S., & Zhang, H. (2026). The Principal’s Principles: Actionable (Personalized) AI Alignment as Underexplored XAI Application Context. Proceedings of the 3rd TRR 318 Conference: Contextualizing Explanations (ContEx25), Bielefeld, Germany. [Extended abstract.] doi:10.64136/jfoo1236
Dafoe, A., Hughes, E., Bachrach, Y., Collins, T., McKee, K. R., Leibo, J. Z., Larson, K., & Graepel, T. (2020). Open Problems in Cooperative AI. arXiv:2012.08630. arxiv.org
Fearon, J. D. (1995). Rationalist explanations for war. International Organization, 49(3), 379–414. doi:10.1017/S0020818300033324
Fearon, J. D. (2020). Two kinds of cooperative AI challenges: Game play and game design. Cooperative AI Workshop, NeurIPS 2020. slideslive.com
Horty, J. (2012). Reasons as Defaults. Oxford University Press. doi:10.1093/acprof:oso/9780199744077.001.0001
Jahn, F., Muskalla, Y., Dargasz, L., Schramowski, P., & Baum, K. (2026). Breaking Up with Normatively Monolithic Agency with GRACE. IASEAI 2026. arXiv:2601.10520
Langer, M., Baum, K., & Schlicker, N. (2024). Effective Human Oversight of AI-Based Systems: A Signal Detection Perspective. Minds and Machines, 35(1), 1–30. doi:10.1007/s11023-024-09701-0
Levine, S., Franklin, M., Zhi-Xuan, T., Yanik Guyot, S., Wong, L., Kilov, D., Choi, Y., Tenenbaum, J. B., Goodman, N., Lazar, S., & Gabriel, I. (2025). Resource Rational Contractualism Should Guide AI Alignment. arXiv:2506.17434
Raz, J. (1990). Practical Reason and Norms (2nd ed.). Oxford University Press.
Scanlon, T. M. (1998). What We Owe to Each Other. Harvard University Press.
Schelling, T. C. (1960). The Strategy of Conflict. Harvard University Press.
Sterz, S., Baum, K., et al. (2024). On the Quest for Effectiveness in Human Oversight: Interdisciplinary Perspectives. FAccT 2024. doi:10.1145/3630106.3659051