Safe AI Should be Bounded and Multi-Agent

This is an old revision of the document!

David Hyland, Daniel Jarne Ornia, Nicholas Bishop, Joel Dyer, Olivia Macmillan-Scott, Tomáš Gaveniak, Ani Calinescu, Michael Wooldridge, Fernando Rosas, Pedro Ortega

Keywords: AI safety, bounded agency, multi-agent systems, modularity, verification, governance.

Position Paper, June 2026

Abstract

The scaling paradigm treats bounds on compute, memory, information, authority, and affordances as obstacles. This paper argues that these bounds are also design variables. A bounded agent is an agent whose information, architecture, resources, or affordances limit its ability to optimise its objective. A bounded multi-agent system (BMAS) composes such agents through explicit interfaces so that system-level capability is obtained through decomposition, delegation, verification, and coordination.

The safety claim is architectural. If unsafe behaviour requires a conjunction of capabilities, then risk can be reduced by separating those capabilities across components and controlling the interfaces that recombine them. BMAS therefore turns some global safety problems into local specification, monitoring, and verification problems. This does not eliminate multi-agent risk; it makes the relevant risk-bearing structures explicit.

Current frontier systems concentrate heterogeneous capabilities in single models. A single model may reason, retrieve information, write code, plan over long horizons, call tools, read private data, process untrusted content, and communicate externally. This concentration creates a difficult safety problem. The same system that solves the user’s task may also possess the information and affordances required for unsafe behaviour.

BMAS starts from a different decomposition. A task induces a capability profile: reasoning, knowledge, coding, planning, tool use, verification, and communication may all be required in different proportions. Unsafe behaviour also has a capability profile. Risk is highest when the capabilities required for task completion and the capabilities required for harm are colocated in one agent.

The BMAS proposal is to design systems in which useful capability is distributed across bounded components. A planner plans. A retriever retrieves. A coder writes code. A verifier checks outputs. A monitor inspects interactions. An executor acts under controlled authority. The system becomes capable through composition, while each component has a restricted scope.

A bounded agent has limits on at least one of four dimensions: information, computation, authority, and affordances. These limits are safety-relevant because many harms require a conjunction of these dimensions.

Let $H$ be an unsafe behaviour. Suppose $H$ requires private information, an untrusted instruction channel, and external communication. If a single component has all three, then a prompt-injection path can in principle connect untrusted input to private-data exfiltration. If no component has all three, then the same harm requires crossing an interface. That interface can be logged, filtered, verified, rate-limited, or blocked.

This is the basic logic of bounded agency. Safety is improved by removing direct causal paths from dangerous inputs to dangerous outputs. Alignment of a component remains important, but the architecture no longer relies entirely on the internal disposition of a single broad optimiser.

A bounded multi-agent system is a collection of bounded agents with designed interfaces. The interfaces specify what information is transmitted, which actions are permitted, what evidence is recorded, and which checks must occur before execution.

The interfaces are central. A collection of agents without explicit interfaces can reproduce the opacity of a monolithic model at a higher level. A BMAS requires legible communication channels, explicit permissions, and inspectable intermediate objects. The system should expose plans, claims, code, tool calls, votes, proofs, critiques, and approvals as objects that can be monitored and evaluated.

The relevant unit of design is therefore the pair consisting of a bounded component and its interface. A narrow verifier with an unrestricted communication channel is unsafe. A powerful planner with no execution authority may be acceptable. A weak executor with excessive permissions may be dangerous. Capability cannot be assessed independently of affordance.

BMAS can create system-level capability without assigning broad capability to every component. This is a standard fact about organised systems. Firms, laboratories, software systems, and scientific communities solve tasks through division of labour, memory, specialization, criticism, and aggregation.

The same structure applies to AI. A coordinator decomposes a task. Specialists solve subtasks. Verifiers check intermediate outputs. Monitors inspect communication. The final answer is composed from checked parts. This architecture can improve reliability whenever decomposition produces subproblems whose solutions are easier to verify than to generate.

There is also a learning argument. General models are useful for fluid problem solving: they can search, propose, and synthesize. Successful behaviours can then be crystallised into bounded agents through distillation, fine-tuning, tool wrappers, cached procedures, or written skills. The resulting specialist is easier to benchmark and constrain because its task distribution is narrower.

This gives a concrete capability mechanism. A general model discovers a procedure; a bounded component stores or executes it; a verifier checks its outputs; an orchestrator decides when to invoke it. Capability is preserved through reuse, while the broad model need not retain all authority at execution time.

BMAS changes the structure of failure. In a monolithic system, unsafe behaviour may arise from an internal trajectory that is difficult to observe. In a BMAS, the corresponding trajectory must pass through messages, tool calls, delegation decisions, verifier outputs, and execution gates. These are observable events.

This makes monitoring more precise. A monitor can inspect whether a planner is routing around a verifier, whether a retriever is leaking irrelevant private data, whether a coder is introducing unexplained network calls, or whether an executor is acting without an approved plan. These checks are meaningful because the architecture separates roles.

The “lethal trifecta” gives the clearest example. Private data, untrusted content, and external communication form a dangerous conjunction. A system that reads private mail, browses arbitrary web pages, and sends messages can be induced to leak secrets if untrusted text controls the action channel. A BMAS can separate the three functions. The private-data agent summarizes under a restrictive contract. The untrusted-content agent works in a sandbox. The communication agent receives only approved content. A verifier mediates transfers. The unsafe path now requires a failure of the interface policy, not merely a failure of model judgment.

The same reasoning applies to long-horizon agency. Broad optimisers can exploit errors in their objectives. Bounded agents have restricted search spaces, limited tools, and local goals. These restrictions act as regularizers. They reduce the set of policies the agent can realize, and hence reduce the probability that misspecification is amplified into extreme behaviour.

BMAS makes verification local. A verifier need not certify an entire intelligent system. It can certify that a proof follows from assumptions, that a code patch passes tests, that a retrieved document supports a claim, or that a proposed action satisfies a policy.

Local verification has a clear mathematical form. Suppose a component contract states that inputs in class $X$ must produce outputs in class $Y$. The verification problem is to test membership in $Y$ conditional on an input in $X$. This is easier than verifying arbitrary behaviour over the full state space of a general model.

The global problem remains compositional. If components satisfy local contracts $P_1,\ldots,P_n$, the system is safe only when the composition rule implies the desired global property $G$. In general,

$$ P_1 \wedge \cdots \wedge P_n \nRightarrow G. $$

A science of BMAS therefore requires composition theorems: conditions under which local guarantees survive routing, delegation, aggregation, and execution. Without such theorems, modularity is an engineering heuristic rather than a safety guarantee.

Redundancy also needs formal treatment. Multiple agents improve reliability only when their errors are sufficiently decorrelated. If a generator and verifier share the same blind spot, verification fails systematically. BMAS therefore requires diversity across models, data, objectives, tools, prompts, or evidence sources when redundancy is used as a safety mechanism.

BMAS gives governance concrete objects. A component can be audited. An interface can be specified. A permission can be revoked. A verifier can be benchmarked. A log can identify which agent proposed, checked, approved, and executed an action.

This matters for accountability. In a monolithic system, responsibility is difficult to allocate because internal causes are entangled. In a BMAS, responsibility can attach to roles: planner, retriever, verifier, monitor, executor, data custodian, tool provider, or orchestrator. Liability and certification can then track functional responsibility rather than treating the system as an undifferentiated model.

Privacy also becomes architectural. Data minimisation is enforced by giving each component only the data required for its contract. A medical-data component need not communicate externally. A communication component need not inspect raw records. A planning component can operate on summaries. These restrictions reduce the harm caused by compromise or misalignment of any single component.

BMAS also supports distributed ownership. Data, tools, verifiers, and agents can be controlled by different parties. This matters for pluralistic alignment because different agents can represent different users, institutions, or normative standpoints. Aggregation can then be handled by standard mechanisms: voting, bargaining, markets, reputation, or constitutional constraints. The alignment problem becomes partly a problem of institutional design.

BMAS introduces risks that monolithic systems do not expose in the same form.

First, coordination can fail. A decomposition may omit a necessary dependency, duplicate work, or route subtasks to inappropriate specialists.

Second, interfaces can be porous. An agent may encode forbidden information in an allowed channel. A planner may smuggle instructions through a retrieval query. A verifier may approve an output outside its competence.

Third, agents can collude. Collusion is especially serious when agents share objectives, training data, or communication conventions. Monitoring must therefore inspect both content and communication patterns.

Fourth, capabilities may recombine. Even when no component individually has the capability profile required for harm, the system may assemble that profile through delegation. The safety boundary is therefore a property of the interaction graph, not merely of the nodes.

These risks do not undermine the BMAS proposal. They specify its technical agenda. The object of study is the architecture-induced relation between local bounds and global behaviour.

The paper identifies several problems that need theory and benchmarks.

Task decomposition. Given a task, a resource budget, and an assurance requirement, determine a decomposition into bounded agents and interfaces.

Agent composition. Characterize how capabilities combine under hierarchy, debate, voting, markets, delegation, and redundancy.

Multi-agent risk. Measure harms that arise from interaction rather than from any single component: collusion, drift, cascading failure, manipulation, and unsafe recombination.

Compositional safety. Prove conditions under which local component guarantees imply global system guarantees.

Recoverability. Design systems whose failures are detectable, containable, reversible, and repairable.

Benchmarks. Compare BMAS and monolithic systems under matched task distributions, resource budgets, risk tolerances, and assurance requirements.

BMAS treats boundedness as an architectural primitive. Bounds on information, computation, authority, and affordances define the safety-relevant shape of an agent. Composition then determines whether the system recovers useful capability while preventing dangerous conjunctions.

The research programme is precise. Identify the capability profile required by the task. Identify the capability profile required for unsafe behaviour. Design bounded agents whose composition covers the former while controlling paths to the latter. Prove that the interface rules preserve the intended safety properties. Evaluate the resulting architecture against monolithic baselines.

Safe AI requires this level of architectural analysis. Scaling determines what a model can do. BMAS determines which components may do what, with which information, under which checks, and through which interfaces.

Safe AI Should be Bounded and Multi-Agent

Abstract

Introduction

Bounded agents

Bounded multi-agent systems

Capability argument

Safety argument

Verification and compositionality

Governance argument

Risks specific to BMAS

Open problems

Conclusion