The question that arrives after an organisation decides to build multi-agent systems with a pro-code approach is always the same: which framework should we use? AutoGen, LangGraph, CrewAI, the Claude Agent SDK — the landscape has four serious contenders, each with different design philosophies, different strengths, and vocal advocates who insist their choice is the right one. The framework comparison articles are abundant. Most of them are wrong — not about the frameworks, but about what matters.

Framework choice is a second-order decision. It determines the programming model and the abstractions you work with. It does not determine whether your multi-agent system creates enterprise value. Five architectural decisions determine that: how your agents communicate and coordinate (orchestration design), how they accumulate and share knowledge (memory architecture), how they are governed and constrained (governance layer), how they select and route between models (model routing), and how you monitor and debug them in production (observability). Get these five right, and any serious framework will execute. Get them wrong, and no framework will save you.

This is not a claim that frameworks are interchangeable. They are not — and this article covers the meaningful differences. But the architecture sits above the framework, and the architecture is where multi-agent systems succeed or fail at enterprise scale.

Orchestration design: how agents coordinate

Orchestration design is the decision about how agents communicate, share state, and resolve conflicts. Three patterns dominate enterprise multi-agent systems, and the right choice depends on the workflow being automated, not on framework preference.

Hub-and-spoke orchestration places a central orchestrator agent that receives all requests, interprets intent, and routes to specialised sub-agents. The orchestrator maintains the master context, and sub-agents report their results back to it. This is the pattern that Copilot Studio implements natively, and it is the right pattern for customer service routing (one orchestrator, five to ten specialised agents), document processing pipelines (one router, specialised extractors), and any workflow where the coordination logic is relatively simple and the specialisation is in the execution, not the coordination.

The limitation of hub-and-spoke is that the orchestrator becomes a bottleneck — every interaction passes through it, it needs to understand the capabilities of every sub-agent, and its context window must accommodate the full conversation history. As the number of sub-agents grows beyond fifteen to twenty, the orchestrator's ability to route accurately degrades. More importantly, sub-agents cannot communicate directly with each other, which prevents the collaborative patterns that create the most value.

Peer-to-peer orchestration allows agents to communicate directly, without a central coordinator. Each agent knows which other agents it can delegate to or consult, and the communication graph can be dynamic — agents can discover and invoke other agents based on the task requirements. This pattern enables the iterative reasoning loops — research-validate-refine cycles, multi-agent debate, consensus-building — that define advanced multi-agent systems.

Peer-to-peer is the right pattern when agents need to challenge, refine, and build on each other's outputs. A due diligence system where a research agent, a financial analysis agent, and a legal review agent each contribute findings and cross-validate is inherently peer-to-peer — no single orchestrator can manage the multi-directional flow of information and the iterative refinement that produces reliable results.

The challenge of peer-to-peer orchestration is governance. Without a central coordinator, determining who decided what — and why — requires explicit logging and audit trails at every agent-to-agent communication point. This is the primary reason most enterprise implementations start with hub-and-spoke and evolve toward peer-to-peer as their governance infrastructure matures.

Hierarchical orchestration combines elements of both. A top-level orchestrator delegates to team leads, each of which coordinates a group of specialised agents. The financial analysis team has a team lead that coordinates analyst, researcher, and risk assessment agents. The legal review team has a team lead that coordinates contract review, regulatory compliance, and IP assessment agents. The top-level orchestrator coordinates between teams, not between individual agents.

This pattern scales better than flat hub-and-spoke because each team lead only needs to understand its own agents, and the top-level orchestrator only needs to understand teams, not individual capabilities. It is the pattern that most closely mirrors how large organisations actually coordinate work — departments with managers who coordinate specialists, reporting to executives who coordinate departments.

Memory architecture: how agents learn

Memory architecture is the decision that most directly determines whether a multi-agent system produces linear or compounding value. It is also the decision that most organisations under-invest in because it has no immediate visible impact — a system without shared memory works fine for the first hundred interactions. It fails to compound value across the first thousand.

Isolated memory means each agent maintains its own context and history. When the sales agent learns that a customer has a specific pain point, that knowledge stays with the sales agent. The support agent that handles the same customer's next interaction starts without that context. Isolated memory is simple to implement, has no coordination overhead, and is the default in most frameworks. It is also the architecture that produces the Level 1 tool-level value that McKinsey, BCG, and Bain consistently identify as insufficient for meaningful EBIT impact.

Shared memory means agents read from and write to a common knowledge base. The sales agent's discovery about the customer's pain point is written to shared memory. When the support agent handles the next interaction, it reads that context and responds accordingly. When the product team's agent analyses feature requests, it can query the shared memory for patterns across all customer interactions, not just the ones it participated in.

Shared memory is where knowledge compounds. But it introduces hard design problems: concurrency (what happens when two agents write conflicting information simultaneously), relevance (how does an agent retrieve only the memory that matters for its current task, not the entire history), decay (how does the system handle knowledge that becomes outdated), and privacy (which agents should have access to which memories, particularly when the knowledge includes customer data subject to data protection requirements).

Persistence strategy determines whether memory survives beyond a single session. Session-scoped memory (the default in most frameworks) means all context is lost when the conversation ends. Persistent memory means knowledge accumulated during one interaction is available in the next, across days, weeks, and months. The financial multi-agent system described in the architecture decision article — where agents maintain a kanban-style board of ongoing findings, proposals, and tracked issues — requires persistent shared memory. Without it, the system cannot compound knowledge over time, and the core value proposition collapses.

The practical implementation pattern that works at enterprise scale is a tiered approach: short-term memory (the current conversation context), medium-term memory (the current session or task), and long-term memory (the persistent knowledge base that survives across all interactions). Each tier has different retrieval strategies, different storage requirements, and different governance rules.

Governance layer: who decides what

The AI decision architecture that applies to individual AI systems becomes exponentially more complex in multi-agent environments, because decisions emerge from interactions between agents rather than from a single system. A well-governed multi-agent system requires three types of rules, implemented as system constraints rather than policy documents.

Delegation rules define what each agent can decide independently, what requires confirmation from another agent, and what requires human approval. A procurement agent can approve purchases below €5,000 without human involvement. Between €5,000 and €50,000, the compliance agent must confirm that the purchase meets policy requirements. Above €50,000, a human decision-maker must approve. These thresholds are not suggestions — they are hard constraints enforced in code.

Escalation rules define what happens when agents encounter situations outside their operational boundaries. An autonomous monitoring agent that detects an anomaly it was not designed to handle must escalate — not attempt to resolve it. The escalation path must be explicit: which agent receives the escalation, what information is passed, and what happens if the escalation is not acknowledged within a defined time window. Implicit escalation — where agents attempt to handle situations beyond their capability because no explicit boundary was set — is the most common failure mode in multi-agent systems and the most expensive to recover from.

Boundary rules define what agents are explicitly prohibited from doing. A customer-facing agent cannot make pricing commitments that override the pricing engine. A supply chain agent cannot commit to delivery dates that the logistics system cannot fulfil. A compliance agent can flag risks but cannot approve its own risk assessments — a different agent or a human must confirm. Boundary rules prevent the failure mode where locally optimal agent decisions produce globally harmful outcomes.

The governance frameworks for mid-market companies apply directly to multi-agent systems, but with an additional layer of complexity: governance must cover not only individual agent decisions but the emergent behaviour of the system as a whole. A multi-agent system where each individual agent operates within its governance boundaries can still produce ungoverned outcomes if the interaction between agents is not explicitly governed. This is the governance gap that McKinsey identifies as the primary barrier to agentic AI scaling — nearly two-thirds of organisations lack the governance frameworks for autonomous agent systems.

Model routing: matching capability to cost

Model routing is the decision about which AI model handles which task within the multi-agent system. It is an economic decision as much as a technical one, and getting it right can reduce inference costs by 60 to 80 percent while maintaining or improving output quality.

The principle is straightforward: use the cheapest model that meets the quality requirement for each specific task. A triage agent that classifies incoming requests into five categories does not need a frontier model — a small, fast model like Claude Haiku or GPT-4o mini handles classification at a fraction of the cost with comparable accuracy. A reasoning agent that analyses complex financial scenarios and generates recommendations needs a frontier model — Claude Opus or GPT-4 — because the reasoning quality directly determines the output value. A code generation agent benefits from a model specifically trained for code.

In practice, model routing introduces three sources of complexity. Latency varies between models, and switching models mid-task introduces variable response times that affect the user experience and the coordination timing between agents. Prompt formats and system prompts may need adjustment when switching between model families — a prompt optimised for Claude may underperform on GPT and vice versa. And fallback logic is necessary — when a model API is unavailable or rate-limited, the system must gracefully route to an alternative without losing context or producing inconsistent results.

The economic case for model routing is compelling. An enterprise multi-agent system processing ten thousand interactions per day where every agent uses a frontier model will spend five to ten times more on inference than an identical system where triage uses a small model, routine tasks use a mid-tier model, and only complex reasoning tasks use a frontier model. The inference cost analysis covers the economics in detail. In multi-agent systems, the savings multiply because the number of LLM calls per user interaction is higher — each agent in the chain makes its own calls, and a five-agent pipeline means five times the opportunity for cost optimisation through model routing.

Observability: monitoring what you cannot see

Monitoring AI in production is challenging for single-model systems. For multi-agent systems, it is an order of magnitude more complex because failures can occur at any point in the agent chain, and the cause of a bad outcome may be several agents removed from the symptom.

Decision tracing means recording not just the final output but the reasoning chain that produced it. When the procurement agent approves a purchase order that later turns out to violate policy, you need to trace back: which agent flagged the order as compliant, what data did it base that assessment on, which other agents were consulted, and where in the chain did the error occur. Without decision tracing, debugging multi-agent systems is effectively guesswork.

Confidence monitoring means tracking the confidence scores of each agent's outputs and alerting when confidence drops below defined thresholds. An agent that suddenly starts producing low-confidence outputs may indicate model degradation, data quality issues, or changes in the input distribution that the agent was not designed to handle. In a multi-agent system, a low-confidence output from one agent propagates through the chain — the downstream agents make decisions based on uncertain inputs, and the uncertainty compounds.

Performance monitoring means tracking latency, throughput, and error rates per agent and per agent chain. A multi-agent pipeline that processes a customer request through five agents may take anywhere from two seconds to thirty seconds depending on which agents are invoked, which models they use, and whether any external API calls are involved. Understanding where time is spent — and where it spikes — is essential for maintaining acceptable response times as the system scales.

Cost monitoring means tracking inference costs per agent, per chain, and per use case. Without cost monitoring, inference expenses grow invisibly until they appear as a line item in the quarterly cloud bill. Model routing optimises costs at design time. Cost monitoring ensures those optimisations hold as usage patterns evolve.

The framework landscape — positioned correctly

With the five architectural decisions defined, the framework choice becomes a practical question: which framework provides the best abstractions for your specific orchestration pattern, memory architecture, governance requirements, model routing strategy, and observability needs? The four leading frameworks each have genuine strengths and genuine limitations.

AutoGen (Microsoft Research) excels at tool use, code execution, and structured multi-agent coordination. The v1.0 release in 2026 introduced an event-driven architecture that improves production reliability. Enterprise teams typically work across three layers: AgentChat for high-level conversational patterns and rapid prototyping, Core for low-level primitives when AgentChat abstractions leak, and Studio for visual interfaces and stakeholder demos. AutoGen is the natural choice for Microsoft-adjacent enterprises that want a pro-code framework while staying within the Microsoft ecosystem. Its limitation is that the conversational model can make debugging difficult — agents make decisions dynamically, and reproducing failures requires reconstructing the exact conversation state that led to the error.

LangGraph (LangChain) models agents as nodes in a directed graph with conditional edges and shared state. It excels at stateful workflows, cyclical processes, and long-running operations. The checkpointing system enables human-in-the-loop interruptions and automatic retries without losing progress, which makes production deployment more predictable. LangGraph is the strongest choice for complex workflows with branching logic, conditional paths, and state persistence requirements. Its limitation is complexity — the graph-based model is powerful but has a steeper learning curve than conversation-based frameworks, and simple use cases feel over-engineered.

CrewAI uses role-based agent teams where each agent has a defined role, goal, and backstory. Tasks are assigned to agents based on their roles, and the framework handles delegation and coordination. CrewAI is the fastest path from concept to working multi-agent system for business process automation — define the roles, define the tasks, and the framework orchestrates. Its limitation is architectural depth — the role-based model is intuitive but provides less fine-grained control over orchestration patterns, memory management, and model routing than LangGraph or AutoGen.

Claude Agent SDK (Anthropic) provides native tool use, extended thinking for complex reasoning, and computer use capabilities. It has gained significant traction in enterprise deployments, particularly for use cases where reasoning depth determines output quality. Its strength is deep reasoning tasks and autonomous workflows where the quality of each agent's output is critical — the extended thinking capability allows agents to reason through complex problems before producing outputs. Its limitation is ecosystem — it is optimised for Claude models and does not provide the multi-model routing flexibility that AutoGen and LangGraph offer natively.

The honest assessment: no framework is best for everything. AutoGen fits Microsoft-centric enterprises building moderate-complexity agent systems. LangGraph fits engineering teams building complex, stateful workflows that need production-grade persistence and human-in-the-loop. CrewAI fits teams that need to prototype and ship role-based agent systems fast. Claude Agent SDK fits teams prioritising reasoning quality and autonomous capability over multi-model flexibility. Most enterprise multi-agent architectures will eventually use more than one framework — different frameworks for different subsystems, connected through well-defined APIs.

What determines success

The organisations that Bain identifies with 10 to 25 percent EBITDA improvement from workflow-level agent deployment did not achieve those results by choosing the right framework. They achieved them by getting the architectural decisions right: orchestration patterns that match the actual workflow, memory systems that compound knowledge, governance layers that enable autonomy within boundaries, model routing that balances cost and quality, and observability that makes the system debuggable and auditable.

The framework is the tool. The architecture is the leverage. The workflow redesign is the value.

A Fit Call assesses your multi-agent architecture requirements — orchestration patterns, memory needs, governance boundaries, and model routing strategy — and maps them to the framework and deployment approach that matches your organisation's workflows, data landscape, and strategic ambition.

Book a Fit Call →


References: AutoGen v1.0 GA documentation, Microsoft Research, 2026; LangGraph v0.4 documentation, LangChain, 2026; CrewAI documentation, 2026; Claude Agent SDK documentation, Anthropic, 2026; Bain & Company, "Technology Report 2025," 2025; McKinsey & Company, "The State of AI: How Organizations Are Rewiring to Capture Value," 2025; BCG, "AI Radar 2025: From Potential to Profit," 2025.