I'd love to see what is being achieved by these massive parallel agent approaches. If it's so much more productive, where is all the great software that's being built with it? What is the OP building?
Most of what I'm seeing is AI influencers promoting their shovels.
Even if somebody shows you what they've built with it, you're none the wiser. All you'll know is that it seemingly works well enough for a greenfield project.
The jury is still very far out on how agentic development affects mid/long term speed and quality. Those feedback cycles are measured in years, not weeks. If we bother to measure at all.
People in our field generally don't do what they know works, because by and large, nobody really knows, beyond personal experiences, and I guess a critical mass doesn't even really care. We do what we believe works. Programming is a pop culture.
Does good design up front matter as much if an AI can refactor in a few hours something that would take a good developer a month? Refactoring is one of those tasks that's tedious, and too non-trivial for automation, but seems perfect for an AI. Especially if you already have all the tests.
Upgrades, API compatibility, and cross version communication are really important in some domains. A bad design can cause huge pain downstream when you need to make a change.
I'm using Claude Code (loving it) and haven't dipped into the agentic parallel worker stuff yet.
Where does one get started?
How do you manage multiple agents working in parallel on a single project? Surely not the same working directory tree, right? Copies? Different branches / PRs?
You can't use your Claude Code login and have to pay API prices, right? How expensive does it get?
Set an env var and ask to create a team. If you're running in tmux it will take over the session and spawn multiple agents all coordinated through a "manager" agent. Recommend running it sandboxed with skip-dangerous-permissions otherwise it's endless approvals
Churns through tokens extremely quickly, so be mindful of your plan/budget.
git checkout four copies of your repo (repo, repo_2, repo_3, repo_4)
within each one open claude code
Works pretty well! With the $100 subscription I usually don't get limited in a day. A lot of thinking needs to go into giving it the right context (markdown specs in repo works for us)
Obv, work on things that don't affect each other, otherwise you'll be asking them to look across PRs and that's messy.
I am now releasing software for projects that have spent years on the back-burner. From my perspective, agent loops have been a success. It makes the impractical pipe-dream doable.
Yeah, I have a never ending need of things I could easily make myself I I could set aside 7-10 hours to plan it out, develop and troubleshoot but are also low priority enough that they sit on the back burner perpetually.
Now these things are being made. I can justify spending 5-10 minutes on something without being upset if AI can't solve the problem yet.
And if not, I'll try again in 6 months. These aren't time sensitive problems to begin with or they wouldn't be rotting on the back burner in the first place.
I just avoided $1.8 million/year in review time w/ parallel agents for a code review workflow.
We have 500+ custom rules that are context sensitive because I work on a large and performance sensitive C++ codebase with cooperative multitasking. Many things that are good are non-intuitive and commercial code review tools don't get 100% coverage of the rules. This took a lot of senior engineering time to review.
Anyways, I set up a massive parallel agent infrastructure in CI that chunks the review guidelines into tickets, adds to a queue, and has agents spit up GitHub code review comments. Then a manager agent validates the comments/suggestions using scripts and posts the review. Since these are coding agents they can autonomously gather context or run code to validate their suggestions.
Instantly reduced mean time to merge by 20% in an A/B test. Assuming 50% of time on review, my org would've needed 285 more review hours a week for the same effect. Super high signal as well, it catches far more than any human can and never gets tired.
Likewise, we can scale this to any arbitrary review task, so I'm looking at adding benchmarking and performance tuning suggestions for menial profiling tasks like "what data structure should I use".
It's for personal use, and I wouldn't call it great software, but I used Claude Code Teams in parallel to create a Fluxbox-compatible window compositor for Wayland [1].
Overall effort was a few days of agentic vibe-coding over a period of about 3 weeks. Would have been faster, but the parallel agents burn though tokens extremely quickly and hit Max plan limits in under an hour.
People are building for themselves. However I’d also reference www.Every.to
They built the popular compound-engineering plugin and have shipped a set of production grade consumer apps. They offer a monthly subscription and keep adding to that subscription by shipping more tools.
I'm experimenting with building an agent swarm to take a very large existing app that's been built over the past two decades (internal to the company I work for) and reverse engineer documentation from the code so I can then use that documentation as the basis for my teams to refactor big chunks of old-no-longer-owned-by-anyone features and to build new features using AI better. The initial work to just build a large-scale understanding of exactly what we actually run in prod is a massively parallelizable task that should be a good fit for some documentation writing agents. Early days but so far my experiments seem to be working out.
Obviously no users will see a benefit directly but I reckon it'll speed up delivery of code a lot.
The long tail of deployable software always strikes at some point, and monetization is not the first thing I think of when I look at my personal backlog.
I also am a tmux+claude enjoyer, highly recommended.
I work for Snowflake and the code I'm building is internal. I'm exploring open sourcing my main project which I built with this system. I'd love to share it one day!
In my view, these agent teams have really only become mainstream in the last ~3 weeks since Claude Code released them. Before that they were out there but were much more niche, like in Factory or Ralphie Wiggum.
There is a component to this that keeps a lot of the software being built with these tools underground: There are a lot of very vocal people who are quick with downvotes and criticisms about things that have been built with the AI tooling, which wouldn't have been applied to the same result (or even poorer result) if generated by human.
This is largely why I haven't released one of the tools I've built for internal use: an easy status dashboard for operations people.
Things I've done with agent teams: Added a first-class ZFS backend to ganeti, rebuilt our "icebreaker" app that we use internally (largely to add special effects and make it more fun), built a "filesystem swiss army knife" for Ansible, converted a Lambda function that does image manipulation and watermarking from Pillow to pyvips, also had it build versions of it in go, rust, and zig for comparison sake, build tooling for regenerating our cache of watermarked images using new branding, have it connect to a pair of MS SQL test servers and identify why logshipping was broken between them, build an Ansible playbook to deploy a new AWS account, make a web app that does a simple video poker app (demo to show the local users group, someone there was asking how to get started with AI), having it brainstorm and build 3 versions of a crossword-themed daily puzzle (just to see what it'd come up with, my wife and I are enjoying TiledWords and I wanted to see what AI would come up with).
Those are the most memorable things I've used the agent teams to build in the last 3 weeks. Many of those things are internal tools or just toys, as another reply said. Some of those are publicly released or in progress for release. Most of these are in addition to my normal work, rather than as a part of it.
Further, my POV is that coding agents crossed a chasm only last December with Opus 4.5 release. Only since then these kinds of agent teams setups actually work. It’s early days for agent orchestration
There are dozens and dozens of these submitted to Show HN, though increasingly without the title prefix now. This one doesn't seem any more interesting than the others.
I picked up a number things from others sharing their setup. While I agree some aspects of these are repetitive (like using md files for planning), I do find useful things here and there.
I did a sort of bell curve with this type of workflow over summer.
- Base Claude Code (released)
- Extensive, self-orchestrated, local specs & documentation; ie waterfall for many features/longer term project goals (summer)
- Base Claude Code (today)
Claude Code is getting better at orchestrating it's own subagents for divide/conquer type work.
My problem with these extensive self-orchestrated multi-agent / spec modes is the type of drift and rot of all the changes and then integrated parts of an application that a lot of the time end up in merge conflicts. Aside from my own decision cognitive space, it's also a lot to just generally orchestrate and review. I spent a ton of type enforcing Claude to use the system I put in place including documentation updates and continuous logging of work.
I feel extremely productive with a single Claude Code for a project. Maybe for minor features, I'll launch Claude Code in the web so that it can operate in an isolated space to knock them out and create a PR.
I will plan and annotate extensively for large features, but not many features or broad project specs all at the same time. Annotation and better planning UX, I think, are going to be increasingly important for now. The only augment of Claude Code I have is a hook for plan mode review: https://github.com/backnotprop/plannotator
The merge conflicts and cognitive load are indeed two big struggles with my setup. Going back to a single Claude instances however would mean I’m waiting for things to happen most of the time. What do
you do while Claude is busy?
It is one of those things I look and thing, yeah you are hyper productive... but it looks cognitively like being a pilot landing a plane all day long, and not what I signed up for. Where is my walk in the local park where I think through stuff and come up with a great idea :(
This is a really cool design, pretty similar to what I've built for implementation planning. I like how iterative it is and that the whole system lives just in markdown. The verify step is a great idea I hadn't made a command yet, thank you!
This seems like it'd be great for solo projects but starts to fall apart for a team with a lot more PRs and distributed state. Heck, I run almost everything in a worktree, so even there the state is distributed. Maybe moving some of the state/plans/etc to Linear et al solves that though.
We ran something similar for a browser automation project - multiple agents working on different modules in parallel with shared markdown specs. The bottleneck wasn't the agents, it was keeping their context from drifting. Each tmux pane has its own session state, so you end up with agents that "know" different versions of reality by the second hour.
The spec file helps, but we found we also needed a short shared "ground truth" file the agents could read before taking any action - basically a live snapshot of what's actually done vs what the spec says. Without it, two agents would sometimes solve the same problem in incompatible ways.
Has anyone found a clean way to sync context across parallel sessions without just dumping everything into one massive file?
I’ve been using Steve Yegge’s Beads[1] lightweight issue tracker for this type of multi-agent context tracking.
I only run a couple of agents at a time, but with Beads you can create issues, then agents can assign them to themselves, etc. Agents or the human driver can also add context in epics, and I think you can have perpetual issues which contain context too. Or could make them as a type of issue yourself, it’s a very flexible system.
The worktree approach is interesting - keeps the filesystem separation clean. The parallelism tradeoff makes sense if the tasks are truly independent, which in practice is most of the time anyway.
What does your spec file look like when you kick off a new agent? Curious if you start from scratch each time or carry over context from previous sessions on the same project.
I describe this in the article - I mostly kick off a new agent per spec both for Planners and Workers. I do tend to run /fd-explore before I start work on a given spec to give the agent context of the codebase and recent previous work
I've been building agent-doc [1] to solve exactly this. Each parallel Claude Code session gets its own markdown document as the interface (e.g., tasks/plan.md, tasks/auth.md). The agent reads/writes to the document, and a snapshot-based diff system means each submit only processes what changed — comments are stripped, so you can annotate without triggering responses.
The routing layer uses tmux: `agent-doc claim`, `route`, `focus`, `layout` commands manage which pane owns which document, scoped to tmux windows. A JetBrains plugin lets you submit from the IDE with a hotkey — it finds the right pane and sends the skill command.
For context sync across agents, the key insight was: don't sync. Each agent owns one document with its own conversation history. The orchestration doc (plan.md) references feature docs but doesn't duplicate their content. When an agent finishes a feature, its key decisions get extracted into SPEC.md. The documents ARE the shared context — any agent can read any document.
It's been working well for running 4-6 parallel sessions across corky (email client), agent-doc itself, and a JetBrains plugin — all from one tmux window with window-scoped routing.
The "don't sync, own" model makes a lot of sense. We were thinking about it wrong - trying to push state out to a shared file, when the cleaner move is to pull it in on demand.
The SPEC.md as the extraction target after a feature is done is a nice touch. In our case the tricky part is that browser automation state is partly external - you have sessions, cookies, proxy assignments that live outside the codebase. So the "ground truth" we needed wasn't just about code decisions but about runtime state too. Ended up logging that separately.
Checking out agent-doc, the snapshot-based diff to avoid re-triggering on comments is clever. Does it handle cases where two agents edit the same doc around the same time, or is the ownership model strict enough that this doesn't come up?
I’ve been experimenting with a similar pattern but wrapping it in a “factory mode” abstraction (we’re building this at CAS[1]) where you define the spec once after careful planning using a supervisor agent then you let it go and spin up parallel workers against it automatically. It handles task decomposition + orchestration so you’re not manually juggling tmux panes
I don't think number of parallel agents is the right productivity metric, or at least you need to account for agent efficiency.
Imagine a superhuman agent who does not need to run in endless loops. It could generate 100k line code-base in a few minutes or solve smaller features in seconds.
In a way, the inefficiency is what leads people to parallelism. There is only room for it because the agents are slow, perhaps the more inefficient and slower the individual agents are, the more parallel we can be.
Few experiments like gas town, the compiler from Anthropic or the browser from Cursor managed to reach the Rocket stage, though in their reports the jagged intelligence of the LLMs was eerily apparent. Do you think we also need better models?
I do. The reason why the current generation of agents are good at coding is because the labs have sufficient time and computes to generate synthetic chain-of-thoughts data, feed those data through RL before use them to train the LLMs. These distillation takes time, time which starts from the release of the previous generation of models.
So we are just now getting agents which can reliably loop themselves for medium size tasks. This generation opens a new door towards agent-managing-agents chain of thoughts data. I think we would only get multi-agents with high reliability sometimes by the mid to end of 2026, assuming no major geopolitical disruption.
Even Claude Max x1 if you run 2 agents with Opus in parallel you're going hit limits. You can balance model for use case thou, but I wouldn't expect it to work on any $20 plan even if you use Kimi Code.
No. I run a similar setup and with $200 subscription, I usually hit weekly quota by around day 3-4. My approach is 4-5 hours of extreme human in the loop spec sessions with opus and codex:
1. We discuss every question with opus, and we ask for second opinion from codex (just a skill that teaches claude how to call codex) where even I'm not sure what's the right approach
2. When context window reaches ~120k tokens, I ask opus to update the relevant spec files.
3. Repeat until all 3 of us - me, opus and codex are happy or are starting to discuss nitpicks, YAGNIs. Whichever earlier.
Then it's fully autonomous until all agents are happy.
Which is why I'm exploring optimization strategies. Based on the analysis of where most of the tokens are spent for my workflow, roughly 40% of it is thinking tokens with "hmm not sure, maybe..", 30% is code files.
So two approaches:
1. Have a cheap supervisor agent that detects that claude is unsure about something (which means spec gap) and alerts me so that I can step in
2. "Oracle" agent that keeps relevant parts of codebase in context and can answer questions from builder agents.
And also delegating some work to cheaper models like GLM where top performance isn't necessary.
You'll notice that as soon as you reach a setup you like that actually works, $200 subscription quotas will become a limiting factor.
That does seem to argue for the checkpointing strategy of having the agent explain their plan and then work on it incrementally. When you run out of tokens you either switch projects until your quota recovers or you proceed by hand until the quota recovers.
I also kinda expect that one of the saner parts of agentic development is the skills system, that skills can be completely deterministic, and that after the Trough of Disillusionment people will be using skills a lot more and AI a lot less.
Yes on both counts. Implementation plan is a second layer after the spec is written, at which point, spec can't be changed by agents. I then launch a planner agent that writes a phased plan file and each builder can only work on a single phase from that file.
So it's spec (human in the loop) > plan > build. Then it cycles autonomously in plan > build until spec goals are achieved. This orchestration is all managed by a simple shell script.
But even with the implementation plan file, a new agent has to orient itself, load files it may later decide were irrelevant, the plan may have not been completely correct, there could have been gaps, initial assumptions may not hold, etc. It then starts eating tokens.
I have /fd-verify which I execute with the Worker after its done implementing. I didn’t feel the need to have a separate window / agent for reviewing. The same Worker can review its own code. What would be the benefits of having a separate Reviewer?
ok -- I am currently quite impressed with a dedicated verifier that has large degree of freedom (very simple prompt). At least when it comes to backend work.
Is there a place where people like you go to share ideas around these new ways of working, other than HN? I'm very curious how these new ways of working will develop. In my system, I use voice memo's to capture thoughts and they become more or less what you have as feature designs. I notice I have a lot of ideas throughout the day (Claude chews through them some time later, and when they are worked out I review its plans in Notion; I use Notion because I can upload memos into it from my phone so it's more or less what you call the index). But ideas.. I can only capture them as they come, otherwise they are lost & I don't want to spend time typing them out.
I'd love to see what is being achieved by these massive parallel agent approaches. If it's so much more productive, where is all the great software that's being built with it? What is the OP building?
Most of what I'm seeing is AI influencers promoting their shovels.
Even if somebody shows you what they've built with it, you're none the wiser. All you'll know is that it seemingly works well enough for a greenfield project.
The jury is still very far out on how agentic development affects mid/long term speed and quality. Those feedback cycles are measured in years, not weeks. If we bother to measure at all.
People in our field generally don't do what they know works, because by and large, nobody really knows, beyond personal experiences, and I guess a critical mass doesn't even really care. We do what we believe works. Programming is a pop culture.
Does good design up front matter as much if an AI can refactor in a few hours something that would take a good developer a month? Refactoring is one of those tasks that's tedious, and too non-trivial for automation, but seems perfect for an AI. Especially if you already have all the tests.
Upgrades, API compatibility, and cross version communication are really important in some domains. A bad design can cause huge pain downstream when you need to make a change.
I'm using Claude Code (loving it) and haven't dipped into the agentic parallel worker stuff yet.
Where does one get started?
How do you manage multiple agents working in parallel on a single project? Surely not the same working directory tree, right? Copies? Different branches / PRs?
You can't use your Claude Code login and have to pay API prices, right? How expensive does it get?
Check out Claude Code Team Orchestration [1].
Set an env var and ask to create a team. If you're running in tmux it will take over the session and spawn multiple agents all coordinated through a "manager" agent. Recommend running it sandboxed with skip-dangerous-permissions otherwise it's endless approvals
Churns through tokens extremely quickly, so be mindful of your plan/budget.
1. https://code.claude.com/docs/en/agent-teams
git checkout four copies of your repo (repo, repo_2, repo_3, repo_4) within each one open claude code Works pretty well! With the $100 subscription I usually don't get limited in a day. A lot of thinking needs to go into giving it the right context (markdown specs in repo works for us)
Obv, work on things that don't affect each other, otherwise you'll be asking them to look across PRs and that's messy.
I am now releasing software for projects that have spent years on the back-burner. From my perspective, agent loops have been a success. It makes the impractical pipe-dream doable.
Yeah, I have a never ending need of things I could easily make myself I I could set aside 7-10 hours to plan it out, develop and troubleshoot but are also low priority enough that they sit on the back burner perpetually.
Now these things are being made. I can justify spending 5-10 minutes on something without being upset if AI can't solve the problem yet.
And if not, I'll try again in 6 months. These aren't time sensitive problems to begin with or they wouldn't be rotting on the back burner in the first place.
That’s completely ignoring the point of the person you are responding to. They weren’t talking about small greenfield projects.
I just avoided $1.8 million/year in review time w/ parallel agents for a code review workflow.
We have 500+ custom rules that are context sensitive because I work on a large and performance sensitive C++ codebase with cooperative multitasking. Many things that are good are non-intuitive and commercial code review tools don't get 100% coverage of the rules. This took a lot of senior engineering time to review.
Anyways, I set up a massive parallel agent infrastructure in CI that chunks the review guidelines into tickets, adds to a queue, and has agents spit up GitHub code review comments. Then a manager agent validates the comments/suggestions using scripts and posts the review. Since these are coding agents they can autonomously gather context or run code to validate their suggestions.
Instantly reduced mean time to merge by 20% in an A/B test. Assuming 50% of time on review, my org would've needed 285 more review hours a week for the same effect. Super high signal as well, it catches far more than any human can and never gets tired.
Likewise, we can scale this to any arbitrary review task, so I'm looking at adding benchmarking and performance tuning suggestions for menial profiling tasks like "what data structure should I use".
It's for personal use, and I wouldn't call it great software, but I used Claude Code Teams in parallel to create a Fluxbox-compatible window compositor for Wayland [1].
Overall effort was a few days of agentic vibe-coding over a period of about 3 weeks. Would have been faster, but the parallel agents burn though tokens extremely quickly and hit Max plan limits in under an hour.
1. https://github.com/ecliptik/fluxland
Pretty cool!
People are building for themselves. However I’d also reference www.Every.to
They built the popular compound-engineering plugin and have shipped a set of production grade consumer apps. They offer a monthly subscription and keep adding to that subscription by shipping more tools.
I'm experimenting with building an agent swarm to take a very large existing app that's been built over the past two decades (internal to the company I work for) and reverse engineer documentation from the code so I can then use that documentation as the basis for my teams to refactor big chunks of old-no-longer-owned-by-anyone features and to build new features using AI better. The initial work to just build a large-scale understanding of exactly what we actually run in prod is a massively parallelizable task that should be a good fit for some documentation writing agents. Early days but so far my experiments seem to be working out.
Obviously no users will see a benefit directly but I reckon it'll speed up delivery of code a lot.
People are building software for themselves.
Correct. I've started recording what I've built (here https://jodavaho.io/posts/dev-what-have-i-wrought.html ), and it's 90% for myself.
The long tail of deployable software always strikes at some point, and monetization is not the first thing I think of when I look at my personal backlog.
I also am a tmux+claude enjoyer, highly recommended.
tmux too.
Trying workmux with claude. Really cool combo
I’ve known too many developers and seen their half-assed definition of Done-Done.
I actually had a manager once who would say Done-Done-Done. He’s clearly seen some shit too.
I work for Snowflake and the code I'm building is internal. I'm exploring open sourcing my main project which I built with this system. I'd love to share it one day!
The influencers generate noise, but the progress is still there. The real productivity gains will start showing up at market scale eventually.
In my view, these agent teams have really only become mainstream in the last ~3 weeks since Claude Code released them. Before that they were out there but were much more niche, like in Factory or Ralphie Wiggum.
There is a component to this that keeps a lot of the software being built with these tools underground: There are a lot of very vocal people who are quick with downvotes and criticisms about things that have been built with the AI tooling, which wouldn't have been applied to the same result (or even poorer result) if generated by human.
This is largely why I haven't released one of the tools I've built for internal use: an easy status dashboard for operations people.
Things I've done with agent teams: Added a first-class ZFS backend to ganeti, rebuilt our "icebreaker" app that we use internally (largely to add special effects and make it more fun), built a "filesystem swiss army knife" for Ansible, converted a Lambda function that does image manipulation and watermarking from Pillow to pyvips, also had it build versions of it in go, rust, and zig for comparison sake, build tooling for regenerating our cache of watermarked images using new branding, have it connect to a pair of MS SQL test servers and identify why logshipping was broken between them, build an Ansible playbook to deploy a new AWS account, make a web app that does a simple video poker app (demo to show the local users group, someone there was asking how to get started with AI), having it brainstorm and build 3 versions of a crossword-themed daily puzzle (just to see what it'd come up with, my wife and I are enjoying TiledWords and I wanted to see what AI would come up with).
Those are the most memorable things I've used the agent teams to build in the last 3 weeks. Many of those things are internal tools or just toys, as another reply said. Some of those are publicly released or in progress for release. Most of these are in addition to my normal work, rather than as a part of it.
Further, my POV is that coding agents crossed a chasm only last December with Opus 4.5 release. Only since then these kinds of agent teams setups actually work. It’s early days for agent orchestration
can you tell us about this "ansible filesystem swiss army knife"?
There are dozens and dozens of these submitted to Show HN, though increasingly without the title prefix now. This one doesn't seem any more interesting than the others.
I picked up a number things from others sharing their setup. While I agree some aspects of these are repetitive (like using md files for planning), I do find useful things here and there.
I built a Erlang based chat server implementing a JMAP extension that Claude wrote the RFC and then wrote the server for
Erlang FTW. I remember the days at the ol' lab!
i have no use for it at my work, i wish i did, so i did this project for run intead.
I wrote a Cash flow tracking finance app in Qt6 using claude and have been using it since Jan 1 to replace my old spreadsheets!
https://git.ceux.org/cashflow.git/
look at Show HN. Half of it is vibe-coded now.
I did a sort of bell curve with this type of workflow over summer.
- Base Claude Code (released)
- Extensive, self-orchestrated, local specs & documentation; ie waterfall for many features/longer term project goals (summer)
- Base Claude Code (today)
Claude Code is getting better at orchestrating it's own subagents for divide/conquer type work.
My problem with these extensive self-orchestrated multi-agent / spec modes is the type of drift and rot of all the changes and then integrated parts of an application that a lot of the time end up in merge conflicts. Aside from my own decision cognitive space, it's also a lot to just generally orchestrate and review. I spent a ton of type enforcing Claude to use the system I put in place including documentation updates and continuous logging of work.
I feel extremely productive with a single Claude Code for a project. Maybe for minor features, I'll launch Claude Code in the web so that it can operate in an isolated space to knock them out and create a PR.
I will plan and annotate extensively for large features, but not many features or broad project specs all at the same time. Annotation and better planning UX, I think, are going to be increasingly important for now. The only augment of Claude Code I have is a hook for plan mode review: https://github.com/backnotprop/plannotator
The merge conflicts and cognitive load are indeed two big struggles with my setup. Going back to a single Claude instances however would mean I’m waiting for things to happen most of the time. What do you do while Claude is busy?
It is one of those things I look and thing, yeah you are hyper productive... but it looks cognitively like being a pilot landing a plane all day long, and not what I signed up for. Where is my walk in the local park where I think through stuff and come up with a great idea :(
it can be cognitively demanding but you adapt and often get in a flow state… but it’s nothing like programming used to be though and I get that
Quite a bit.
- Research
- Scan the web
- Text friends
- Side projects
- Take walks outside
etc
This is a really cool design, pretty similar to what I've built for implementation planning. I like how iterative it is and that the whole system lives just in markdown. The verify step is a great idea I hadn't made a command yet, thank you!
This seems like it'd be great for solo projects but starts to fall apart for a team with a lot more PRs and distributed state. Heck, I run almost everything in a worktree, so even there the state is distributed. Maybe moving some of the state/plans/etc to Linear et al solves that though.
Thanks! I mainly work solo so I haven’t tested this setup in a shared project.
We ran something similar for a browser automation project - multiple agents working on different modules in parallel with shared markdown specs. The bottleneck wasn't the agents, it was keeping their context from drifting. Each tmux pane has its own session state, so you end up with agents that "know" different versions of reality by the second hour.
The spec file helps, but we found we also needed a short shared "ground truth" file the agents could read before taking any action - basically a live snapshot of what's actually done vs what the spec says. Without it, two agents would sometimes solve the same problem in incompatible ways.
Has anyone found a clean way to sync context across parallel sessions without just dumping everything into one massive file?
I’ve been using Steve Yegge’s Beads[1] lightweight issue tracker for this type of multi-agent context tracking.
I only run a couple of agents at a time, but with Beads you can create issues, then agents can assign them to themselves, etc. Agents or the human driver can also add context in epics, and I think you can have perpetual issues which contain context too. Or could make them as a type of issue yourself, it’s a very flexible system.
[1] https://github.com/steveyegge/beads
Beads has been on my list to try. I can see it being a natural evolution of my setup
I avoid this with one spec = one agent, with worktrees if there is a chance of code clashing. Not ideal for parallelism though.
The worktree approach is interesting - keeps the filesystem separation clean. The parallelism tradeoff makes sense if the tasks are truly independent, which in practice is most of the time anyway.
What does your spec file look like when you kick off a new agent? Curious if you start from scratch each time or carry over context from previous sessions on the same project.
I describe this in the article - I mostly kick off a new agent per spec both for Planners and Workers. I do tend to run /fd-explore before I start work on a given spec to give the agent context of the codebase and recent previous work
I've been building agent-doc [1] to solve exactly this. Each parallel Claude Code session gets its own markdown document as the interface (e.g., tasks/plan.md, tasks/auth.md). The agent reads/writes to the document, and a snapshot-based diff system means each submit only processes what changed — comments are stripped, so you can annotate without triggering responses.
The routing layer uses tmux: `agent-doc claim`, `route`, `focus`, `layout` commands manage which pane owns which document, scoped to tmux windows. A JetBrains plugin lets you submit from the IDE with a hotkey — it finds the right pane and sends the skill command.
For context sync across agents, the key insight was: don't sync. Each agent owns one document with its own conversation history. The orchestration doc (plan.md) references feature docs but doesn't duplicate their content. When an agent finishes a feature, its key decisions get extracted into SPEC.md. The documents ARE the shared context — any agent can read any document.
It's been working well for running 4-6 parallel sessions across corky (email client), agent-doc itself, and a JetBrains plugin — all from one tmux window with window-scoped routing.
[1] https://github.com/btakita/agent-doc
The "don't sync, own" model makes a lot of sense. We were thinking about it wrong - trying to push state out to a shared file, when the cleaner move is to pull it in on demand.
The SPEC.md as the extraction target after a feature is done is a nice touch. In our case the tricky part is that browser automation state is partly external - you have sessions, cookies, proxy assignments that live outside the codebase. So the "ground truth" we needed wasn't just about code decisions but about runtime state too. Ended up logging that separately.
Checking out agent-doc, the snapshot-based diff to avoid re-triggering on comments is clever. Does it handle cases where two agents edit the same doc around the same time, or is the ownership model strict enough that this doesn't come up?
I just can’t get over the fact that your Anglicized name sounds like manual shipper.
it is ironic
I’ve been experimenting with a similar pattern but wrapping it in a “factory mode” abstraction (we’re building this at CAS[1]) where you define the spec once after careful planning using a supervisor agent then you let it go and spin up parallel workers against it automatically. It handles task decomposition + orchestration so you’re not manually juggling tmux panes
[1] https://cas.dev
Do parallel workers execute on the same spec? How do you ensure they don't clash with each other?
supervisor handles this. if it sees that workers can collide it spawns them in worktrees while it handles the merging and cherry-picking
do you find the merging agent to be reliable? I had a few bad merges in the past that makes me nervous of just letting agents take care of it
Opus 4.6 is great at this compared to other models
Yeah the 8 agents limit aligns well with my conversations with folks in the leading labs
https://open.substack.com/pub/sluongng/p/stages-of-coding-ag...
I think we need much different toolings to go beyond 1 human - 10 agents ratio. And much much different tooling to achieve a higher ratio than that
I don't think number of parallel agents is the right productivity metric, or at least you need to account for agent efficiency.
Imagine a superhuman agent who does not need to run in endless loops. It could generate 100k line code-base in a few minutes or solve smaller features in seconds.
In a way, the inefficiency is what leads people to parallelism. There is only room for it because the agents are slow, perhaps the more inefficient and slower the individual agents are, the more parallel we can be.
Few experiments like gas town, the compiler from Anthropic or the browser from Cursor managed to reach the Rocket stage, though in their reports the jagged intelligence of the LLMs was eerily apparent. Do you think we also need better models?
I do. The reason why the current generation of agents are good at coding is because the labs have sufficient time and computes to generate synthetic chain-of-thoughts data, feed those data through RL before use them to train the LLMs. These distillation takes time, time which starts from the release of the previous generation of models.
So we are just now getting agents which can reliably loop themselves for medium size tasks. This generation opens a new door towards agent-managing-agents chain of thoughts data. I think we would only get multi-agents with high reliability sometimes by the mid to end of 2026, assuming no major geopolitical disruption.
I liked the way how you bootstrap the agent from a single markdown file.
I built so much muscle memory from the original system, so it made sense to apply it to other projects. This was the simplest way to achieve that
These setups pretty much require the top tier subscription, right?
Even Claude Max x1 if you run 2 agents with Opus in parallel you're going hit limits. You can balance model for use case thou, but I wouldn't expect it to work on any $20 plan even if you use Kimi Code.
More like 2x$200 plans.
That's a yes from my side.
Is one $200 plan sufficient to run 8x Claude Code with Opus 4.6? Or what else you need in terms of subscriptions?
No. I run a similar setup and with $200 subscription, I usually hit weekly quota by around day 3-4. My approach is 4-5 hours of extreme human in the loop spec sessions with opus and codex:
1. We discuss every question with opus, and we ask for second opinion from codex (just a skill that teaches claude how to call codex) where even I'm not sure what's the right approach 2. When context window reaches ~120k tokens, I ask opus to update the relevant spec files. 3. Repeat until all 3 of us - me, opus and codex are happy or are starting to discuss nitpicks, YAGNIs. Whichever earlier.
Then it's fully autonomous until all agents are happy.
Which is why I'm exploring optimization strategies. Based on the analysis of where most of the tokens are spent for my workflow, roughly 40% of it is thinking tokens with "hmm not sure, maybe..", 30% is code files.
So two approaches: 1. Have a cheap supervisor agent that detects that claude is unsure about something (which means spec gap) and alerts me so that I can step in 2. "Oracle" agent that keeps relevant parts of codebase in context and can answer questions from builder agents.
And also delegating some work to cheaper models like GLM where top performance isn't necessary.
You'll notice that as soon as you reach a setup you like that actually works, $200 subscription quotas will become a limiting factor.
That does seem to argue for the checkpointing strategy of having the agent explain their plan and then work on it incrementally. When you run out of tokens you either switch projects until your quota recovers or you proceed by hand until the quota recovers.
I also kinda expect that one of the saner parts of agentic development is the skills system, that skills can be completely deterministic, and that after the Trough of Disillusionment people will be using skills a lot more and AI a lot less.
Yes on both counts. Implementation plan is a second layer after the spec is written, at which point, spec can't be changed by agents. I then launch a planner agent that writes a phased plan file and each builder can only work on a single phase from that file.
So it's spec (human in the loop) > plan > build. Then it cycles autonomously in plan > build until spec goals are achieved. This orchestration is all managed by a simple shell script.
But even with the implementation plan file, a new agent has to orient itself, load files it may later decide were irrelevant, the plan may have not been completely correct, there could have been gaps, initial assumptions may not hold, etc. It then starts eating tokens.
And it feels like this can be optimized further.
And yes on deterministic tooling as well.
I think you should have a reviewer as well.
I have /fd-verify which I execute with the Worker after its done implementing. I didn’t feel the need to have a separate window / agent for reviewing. The same Worker can review its own code. What would be the benefits of having a separate Reviewer?
ok -- I am currently quite impressed with a dedicated verifier that has large degree of freedom (very simple prompt). At least when it comes to backend work.
sorry, reviewer. Github issues used by implementer and reviewer for back-and-forth
Is there a place where people like you go to share ideas around these new ways of working, other than HN? I'm very curious how these new ways of working will develop. In my system, I use voice memo's to capture thoughts and they become more or less what you have as feature designs. I notice I have a lot of ideas throughout the day (Claude chews through them some time later, and when they are worked out I review its plans in Notion; I use Notion because I can upload memos into it from my phone so it's more or less what you call the index). But ideas.. I can only capture them as they come, otherwise they are lost & I don't want to spend time typing them out.
I have only seen similar posts in HN or X. I’d be curious if there are more.