Long running autonomous agents for knowledge work
In this article we’ll break down the latest case-studies by Cursor and Anthropic on long-running agent tasks, as well as our own learnings at Lexifina on using agents for knowledge work.
Today's agents work well for focused tasks, but are slow for complex projects. The natural next step is to run multiple agents in parallel, but figuring out how to coordinate them is challenging.
In theory, we can scale to an infinite number of agents, all running in parallel to solve problems. In practice, this is prevented by the need to synchronise context, and prevent agents from interfering with the user, as well as other agents.
For knowledge work, tasks delegated and completed autonomously by an AI agent need to be easily verified, and the cognitive effort required to interact with the results must fit into the wider workflow. A key advantage to AI is the ability to scale up work, but not all work scales well.
In the agent prompt, I tell Claude what problem to solve and ask it to approach the problem by breaking it into small pieces, tracking what it’s working on, figuring out what to work on next, and to effectively keep going until it’s perfect.
Anthropic - Building a C Compiler
Discretising a task into a series of sub-tasks is one of the best ways to delegate work, and it’s particularly applicable to AI agents for multiple reasons. Firstly, when working in smaller steps, agents make fewer mistakes, and they tend to be less catastrophic. Moreover, there is less ambiguity, which improves the alignment of model behaviour with task intent.
Running multiple Claude agents allows for specialization. While a few agents are tasked to solve the actual problem at hand, other specialized agents can be invoked to (for example) maintain documentation, keep an eye on code quality, or solve specialized sub-tasks.
Anthropic - Building a C Compiler
It’s easy to deploy agents with many tools at once using Model Context Protocol (MCP). However this causes the agents to struggle to select and deploy them appropriately. By specialising agents, and providing them with a much smaller subset of tools relevant to a specialised task, we eliminate that problem.
In the instance of legal work, we might use an agent specialised to check for font and formatting issues in a DOCX file. That agent might use an agent skill to extract and evaluate the raw OOXML values encoded in the file. We might employ another agent to flag clauses for risk using a firm playbook, allowing the agent to draw from, and framing the task using live user information.
This approach radically improves the probability of an agent working successfully. All we are doing is reducing the number of failure modes for the agent.
When stuck on a bug, Claude will often maintain a running doc of failed approaches and remaining tasks. In the git repository of the project, you can read through the history and watch it take out locks on various tasks.
Anthropic - Building a C Compiler
All new agents start off as “stateless”. That is, they have the knowledge of the language model, but need to assemble the context required to complete your task.
To improve the ability of an agent to retrieve the relevant information, we can allow them to automatically map key relationships, such as the relationship between clients and documents, if they aren’t already provided in a structured way. This enables agents to rapidly navigate the knowledge base. There is a great deal of nuance as to how this structural information can be auto-generated, but ultimately the output is a series of markdown files.
More interestingly, we can also maintain a task-specific record of agent performance, and agent failure modes. This allows other agents to synthesise novel approaches, and waste less time, with time sensitivity for tasks being another value we need to “bake in” to agents.
Context window pollution: The test harness should not print thousands of useless bytes. At most, it should print a few lines of output and log all important information to a file so Claude can find it when needed. Logfiles should be easy to process automatically: if there are errors, Claude should write ERROR and put the reason on the same line so grep will find it. It helps to pre-compute aggregate summary statistics so Claude doesn't have to recompute them.
It’s surprisingly easy to accumulate a large volume of low value information in agent context, which degrades performance. There is no silver bullet here, but best practices include expressing changes to documents as specific patches or insertions, and to only provide the most relevant information for a task (such as stripping formatting when generating text-only changes).
Parallelism also enables specialization. LLM-written code frequently re-implements existing functionality, so I tasked one agent with coalescing any duplicate code it found. I put another in charge of improving the performance of the compiler itself, and a third I made responsible for outputting efficient compiled code. I asked another agent to critique the design of the project from the perspective of a Rust developer, and make structural changes to the project to improve the overall code quality, and another to work on documentation.
Anthropic - Building a C Compiler
Tightly scoped agents are best practice. However, this means they can replicate work, and produce a highly non-uniform document. Getting another agent to work at a higher level of abstraction is a useful way to modulate complexity.
For example, an agent can standardise how clauses are referenced within a document, and the terminology used within clauses themselves. This lowers the overall complexity of the document for both human users and agents, and prevents further divergence.
In conclusion, most tasks still require constant, iterative changes to process, decision-making and execution. But verification, styling and review can be augmented and improved by agents by combining increasingly sophisticated strategies and tools.