Multi agent systems for complex tasks
Fundamentals
Agentic systems reveal hard boundaries in current LLM architectures. During pre-training, models are exposed to a distribution of sequence lengths, with the vast majority of examples being substantially shorter than the maximum supported context length. As a result, relatively few optimization updates occur on extremely long sequences, limiting the model's ability to learn long-range dependencies. Moreover, positional representations remain a weak point in long-context model architectures because they must generalize to increasingly distant token positions, where the model's ability to distinguish relative or absolute positions degrade.
Although long-context capabilities have improved dramatically since early transformer models such as BERT (which was trained on sequences of up to 512 tokens), this all means that context management is still fundamental to agent use. Therefore, we need to continually monitor the context length used by our agents and attempt to frequently clear context where possible.
When we talk about agents and subagents, we’re talking about how we manufacture input sequences that we feed to a stateless LLM. For a subagent, that input sequence usually narrows the task and the available tool set. The model can then generate a structured request indicating that a tool should be used, much like writing a function call. On the backend, the subagent request can also carry additional controls such as model choice, token budget, output schema, and specific permissions. This result is read, the tool is executed, and the trace is fed back into the LLM, which decides how to proceed (creating another turn, invoking another tool, or ending the conversation).
Context management
But how should we provide context to an agent in the first place? For the reasons stated above, we prefer to provide a “encyclopedia” or readily available lookup table, rather than injecting hundreds of thousands of tokens every time, or a summary that provides partial context, where a full context lookup would have allowed an agent to succeed at a task.
Similarly, rather than requiring subagents to insert context into the primary agent, we implement artifacts. Subagents call tools to store their work as artifacts, then pass lightweight references back to the main agent.
Delegation
Discretising a task into a series of sub-tasks is one of the best ways to delegate work, and it’s particularly applicable to AI agents for multiple reasons. Firstly, when working in smaller steps, agents make fewer mistakes, and they tend to be less catastrophic. Breaking work into discrete tasks also reduces ambiguity, improving the alignment between model behaviour and user intent.
This naturally leads to multi-agent systems, where specialised subagents tackle different aspects of a problem. Subagents are excellent for exploring a large problem space because they operate within their own context windows, exploring different sections in parallel. As they have received relatively sparse input instructions, they can rapidly accumulate context, by operating multiple read tools in parallel. This allows subagents to work substantially faster, exploring tens of thousands of tokens, but only returning a condensed artifact of a few thousand tokens to be picked up by the main thread.
The flip side of subagents not receiving a complete system prompt, is that they tend to get derailed easily. Hence, most multi-agent setups are limited to read-only subagents, like web search subagents. It’s difficult to utilise multiple write-agents in parallel, as they are prone to failure individually, and it becomes difficult to consolidate the divergent changes into a single useful output. Hence, in most multi-agent systems, writes remain synchronous, whilst read operations do not.
Every time a subagent is spawned, the orchestrator must determine which model it should use, which tools it should have access to, whether it should execute synchronously or asynchronously, how it should compress its findings, and how the primary agent will synthesise results from multiple subagents. We can formalise these roles, such as “researcher, reviewer or editor” by setting carefully scoped tools and permissions.
These systems are still susceptible to top-level problems, such as the difficulty in determining how to scale effort and evaluate how complex a task is going to be at each step, but the multi-agent framework fundamentally helps to limit the blast-radius for system writes and the size of the main thread.
Compaction
As context grows, compaction becomes inevitable. To preserve the intent of the task, as well as key steps and information, requires an intelligent compression and memory mechanism. For maximum performance, the most effective compaction strategies will treat compaction as an explicit runtime state transition, rather than a user-facing summarisation step.
The system should continuously track context pressure using token counts, context-window limits, model-specific thresholds, retry state, and resume state. This tells the runtime whether a turn can continue normally, should emit a warning to the user, or compact before the next model call.
The system should run a pre-compaction control layer. That layer records why compaction is happening, which session and turn are being changed, which model is active, and whether the operation should continue. This matters because compaction changes what the agent can see next, and can derail a long-running process.
Large tool outputs, logs, artifacts and other recoverable payloads should be cleared and replaced with read-back markers. If the data can be fetched later, it should not be paid for in the active context unless essential. The balance is in keeping a meaningful reasoning state while removing payloads that are expensive but recoverable.
A compact replacement history should include the minimal state required to continue: the current user goal, important decisions, unresolved questions, recent high-signal turns, references to recoverable artifacts, and any required system or tool context. The active conversation should then continue from this replacement state. A human-readable summary can help users understand what happened, but the runtime should depend on structured replacement data. If a user forks or rewinds a conversation, any compaction that summarizes removed turns should be invalidated or rebuilt, because it no longer describes the active history.
Memory goes into how we optimise our navigable index of information for a task. There are general best principles, but making memory task-oriented yields the best results. There are a variety of approaches, including knowledge graphs and ontology, but memory in itself is not a silver bullet, and is part of a delicate system as described above.
We continue design agents and monitor their decision patterns. There are lots of frameworks to do so, but there is as of yet no consensus on what’s best, particularly because every model behaves differently, so we continue to design feedback loops and control systems specific to our specific domain objectives (in our case, legal document drafting).
Sources
- Extending the Context of Pretrained LLMs by Dropping Their Positional EmbeddingsYoav Gelberg, Koshi Eguchi, Takuya Akiba and Edoardo Cetin. arXiv, 2025.
- Context engineering with toolsAnthropic Claude Platform cookbook.
- Unrolling the Codex agent loopOpenAI.
- Effective context engineering for AI agentsAnthropic Engineering.