It's an interesting idea, but I feel like it's missing almost the most important thing; the context of the change itself. When I review a change, it's almost never just about the actual code changes, but reviewing it in the context of what was initially asked, and how it relates to that.
Your solution here seems to exclusively surface "what" changes, but it's impossible for me to know if it's right or not, unless I also see the "how" first and/or together with the change itself. So the same problem remains, except instead of reviewing in git/GitHub/gerrit + figure out the documents/resources that lays out the task itself, I still have to switch and confirm things between the two.
I agree, that's also really important and something we're brainstorming
Currently on Stage we also generate a PR summary next to the chapters and that's where we want to do more "why" that pulls in context from Linear, etc.
And I know there's a lot of cool teams like Mesa and Entire working on embedding agent context into git history itself so that could an interesting area to explore as well
I assume this problem could be solved if we write up what we actually want (like a GH issue) and maybe in the future the guys at Stage could use github issues as part of their PR review?
> more and more engineers are merging changes that they don't really understand
You cannot solve this problem by adding more AI on top. If lack of understanding is the problem, moving people even further away will only worsen the situation.
If I'm reviewing AI code, I don't want AI summaries. I want to be able to read the code and understand what it does. If I can't do that, the code the AI output isn't very good. In theory, your AI changes should be smaller chunks just like a real developer would do.
Quite the opposite, AI should work longer and interrupt the human less often (Tokens are cheap, interruptions are expensive). So we want to push the agent's horizon to infinite, now when they do make interruptions (e.g. Creating diff) it will be larger chunks and more complex, so these summaries are actually quite useful.
You can and should have both smaller chunks and a larger time horizon. The AI should output code in a format that's easy to review.
The Linux kernel submission guidelines are one thing you can feed the AI to guide that. Work is submitted as a patch set, and each patch in the set must be small and self contained. Many patch sets are over 50 patches.
I agree with that - with Stage we're not trying to replace reading code with AI summaries, but rather guiding the reviewer through reading code in the way that makes most sense and coming away with the best understanding
How do you handle the problem of AI misleading by design? For example, Claude already lies on a regular basis specifically (and quite convincingly) in this case, in attempts to convince that what is actually broken isn't such a big deal after all or similar.
How can this product possibly improve the status quo of AI constantly, without end, trying to 'squeak things by' during any and all human and automated review processes? That is, you are giving the AI which already cheats like hell a massive finger on the scale to cheat harder. How does this not immediately make all related problems worse?
The bulk of difficulty in reviewing AI outputs is escaping the framing they never stop trying to apply. It's never just some code. It's always some code that is 'supposed to look like something', alongside a ton of convincing prose promising that it _really_ does do that thing and a bunch of reasons why checking the specific things that would tell you it doesn't isn't something you should do (hiding evidence, etc).
99% of the problem is that the AI already has too much control over presentation when it is motivated about the result of eval. How does giving AI more tools to frame things in a narrative form of its choice and telling you what to look at help? I'm at a loss.
The quantity of code has never been a problem. Or prose. It's that all of it is engineered to mislead / hide things in ways that require a ton of effort to detect. You can't trust it and there's no equivalent of a social cost of 'being caught bullshitting' like you have with real human coworkers. This product seems like it takes that problem and turns the dial to 11.
Thanks for sharing this, I do agree with a lot of what you said especially around trust around what its actually telling you
For me, I only run into problems of an agent misleading/lying to me when working on a large feature, where the agent has strong incentive to lie and pretend like the work is done. However, there doesn't seem to be this same incentive for a completely separate agent that is just generating a narrative of a pull request. Would love to hear what you think
There is no separation. Incentive propagates through LLMs with approximately 0 resistance. If the input tells a story, the output tends to that story reinforced.
The code/PR generator is heavily incentivized to spin by RL on humans - as soon as that spin comes into contact with your narrative gen context, it's cooked. Any output that has actually seen the spin is tainted and starts spinning itself. And then there's also spin originating in the narrative gen... Hence, the examples read like straight advertisements, totally contaminated, shot through with messaging like:
- this is solid, very trustworthy
- you can trust that this is reliable logic with a sensible, comprehensible design
- the patterns are great and very professional and responsible
- etc
If the narrative reads like a glow up photoshoot for the PR, something has gone extremely wrong. This is not conducive to fairly reviewing it. It is presented as way better than it actually is. Even if there are no outright lies, the whole thing is a mischaracterization.
RL is a hell of a drug.
Anyway, this is the problem of AI output. It cannot be trusted that the impression it presents is the reality or even a best attempt at reality. You have to carefully assemble your own view of the real reality in parallel to w/e it gives you, which is a massive pain in the ass. And if you skip that, you just continually let defects/slop through.
Worst problem mucking things up is basically that RL insights that work on people also work on AI, because the AI is modelling human language patterns. Reviewing slop sucks because it's filled with (working) exploits against humans. And AI cannot help because it is immediately subverted. So I guess it requires finding a way to strip out the exploits without changing mechanical details. But hard, because it saturates 100% of output at many levels of abstraction including the mechanical details.
Looks kind of neat like devon.ai review / reviewstack crossover. But as i tell every of the dozens projects trying to make a commercial review tool: i would rather spend a week vibe copying this than onboarding a tool i have to pay for and am at the mercy of whoever made it. Its just over for selling saas tools like this. For agents i also need this local not on someones cloud. Its just a matter of time until someone does it.
Thanks for the feedback! re: local vs cloud, I do think there is a cool work to be done around unifying the writing/reviewing experience locally, but we started with cloud because we designed this as a collaborative product with teams in mind
(i find)the right way to read a PR can differ a lot from project to project. it's not just about context, or syntax, or workflow...
sometimes the best entry point is the PR description or an external ticket. sometimes you need to read the code first to understand the reasoning behind the changes. sometimes the diff itself is fine, but you have to go back several PRs to see how the codebase got into its current state.
i guess like everyone said here, there no right way to do it.
So, when I code review, I have a super simple Cursor command that "orients" me in the PR:
* where does the change sit from a user perspective?
* what are the bookends of the scope?
* how big is the PR?
* etc.
Once I'm "in" and understand what it does, I pepper the AI with questions:
* Why did the author do this?
* I dont understand this?
* This looks funky, can you have a look?
* etc.
The more questions I ask, the more the AI will (essentially) go "oh, I didn't think of that, in fact, looks like the issue was way more serious than I first thought, let me investigate". The more I ask, the more issues AI finds, the more issues AI finds, the more issues I find. There's no shortcuts to quality control -- the human drives the process, AI is merely (and I hate to use this term but I will) a...force multiplier.
Maybe I'm missing something obvious, but if I was going to have my team use this, I'd want someone to answer the following question
If AI is good enough to explain what the change is and call out what to focus on in the review, then why isn't AI good enough to just do the review itself?
I understand that the goal of this is to ensure there's still a human in the review cycle, but the problem I see is that suggestions will quickly turn into todo lists. Devs will read the summary, look at the what to review section, and stop reviewing code outside of things called out in the what to focus on section. If that's true, it means customers need to be able to trust that the AI has enough context to generate accurate summaries and suggestions. If the AI is able to generate accurate summaries and suggestions, then why can't we trust it to just do the review itself?
I'm not saying that to shit on the product, because I do get the logic behind it, but I think that's a question you should have a prepared answer for since I feel like I can't be the only one thinking that.
No worries at all, that's a very fair point and a question we've gotten a lot!
I think our perspective is that: software design has always had a subjective element to it. There's never been a "right" way to design a system, there are always trade offs that have to be made that depend on things like business context etc.
To that extent, most engineers probably still want to be part of that decision making process and not just let agents make all the high level decisions, especially if they're responsible for the code that ultimately gets merged
One thing that comes to mind is that an AI might see the code and say "Yeah, this should compile / no obvious runtime errors", but the AI doesn't have the context to know your teams coding standards (every team has different standards). That said, there are ways to feed that context to the AI, but still risk hallucinations, etc.
Most of human review I see of AI code is rubber stamping at this point, the volume is too big for human to keep up. What used to take Developers a few days to do is taking a few hours so PR volume is higher and human reviewing can't keep up. At this point, human review seems like CYA then anything else, "Why yes SOC2 auditor, we review all PRs."
I'm also seeing a lot more outages as well but management is bouncing around all happy about feature velocity they are shipping so :shrug:
My pain points with PRs where people vibe coded something is a bit different though:
- I'd like to get an idea how they prompted and developed the PR.
- I want to see if for example they just took everything the AI gave them or if they interacted with it critically
- I want to see some convincing proof that they tested it, e.g. manually. I.e. along the lines of what Simon describes here: https://simonwillison.net/2025/Dec/18/code-proven-to-work/
- I want to see an AI doing a review as well
Totally different part of the reviewing experience, but I would love to see PR comments (or any revisions really) be automatically synced back to the context coding agents have about a codebase or engineer. There’s no reason nowadays for an engineer or a team of engineers to make the same code quality mistake twice. We manually maintain our agents.md with codebase conventions, etc, but it’d be great not to have to do that.
100%. A big part of code review in my mind is to automate away specific mistakes and anti-patterns across a team. I think there are a lot of interesting things to be done to merge the code writing and code reviewing cycles.
It keeps a repository with markdown files as the agent context, makes those available (via a simple search and summarise MCP) and when closing a merge request it checks whether the context needs updating based on the review comments. If it needs updating a PR is opened on the context repository with suggested changes/additions.
> Stage automatically analyzes the diff, clusters related changes, and generates chapters.
Isn't that what commits are for? I see no reason for adding this as an after-thought. If the committers (whether human or LLM) are well-behaved, this info is already available in the PR.
In our experience, it's difficult to create well-mannered commits as you code and new ideas pop into your head or you iterate on different designs (even for LLMs). One concept we toyed around with was telling an LLM to re-do a branch using "perfect commits" right before putting up a PR. But even then you might discover new edge cases and have to tack them on as additional commits.
We thought git wasn't the right level of abstraction and decided to tackle things at the PR level instead. Curious to hear your experiences!
> In our experience, it's difficult to create well-mannered commits
Sure, it is. But it's worth it, not just for code review, but for a myriad other things: bisect, blame, log, etc.
Your tool makes one thing (the code review) easier, while decreasing people's motivation to make well-mannered commits, thus making everything else (bisect etc) worse.
I'm sure it's net positive in some cases, and I think it's net negative in other cases.
> We thought git wasn't the right level of abstraction and decided to tackle things at the PR level instead. Curious to hear your experiences!
The frick is a PR abstraction? Is this a GitHub PR abstraction where the commits are squashed and the PR description is whatever was hallucinated at 5 am? Yes, that’s certainly an abstraction, aka loss of information.
You either have the information stored in the version control database or you don’t. You can curate and digest information but once it’s lost it’s lost.
People layering stuff on top of Git or Subversion makes no sense. Your AI is not so dainty and weak that it cannot write a commit message. And if it can’t then you can recuperate the information that you trashed.
I concur. I cannot accept that we are so disconnected from what we're building that we can't go back and revise our commits or something else to make it make sense.
I was actually recently thinking about similar idea. I am someone who started coding post LLMs and have basic technical understanding. I know what loops, variables, API, backend bla bla is. I learned bunch more since then but I am not capable of making decisions based on git diff alone. And I want to. I want to because I think increasing my skills is still super important, even in AI era. The models are getting better, but are still limited by their core design -- for now it does not seem like they will replace humans.
So getting assistance in the review, in making the decisions and giving me more clarity feels interesting.
Maybe its people like me, who became involved into coding after the LLMs who might be your niche.
One thing I dont understand, the UI/UX? Is this visible only on git itself? Or can I get it working in Codex?
We've wondered about what the review experience should look like for newly technical or non-technical people now that they are increasingly putting up PRs themselves. These people will be less opinionated about certain technical decisions in general so maybe the future looks like review processes very personalized to your experience level and your background. Definitely a lot to think about
Right now the chapters UI is only available on our website but we're exploring possible integrations and/or a desktop app
Why is this a service and not an open source project? It doesn't seem to do much other than organize your commits within a PR (could be run once on a dev machine and shipped in the code, then displayed separately) and builds a dashboard for PRs that's not too far off from what github already offers, but could also be represented with fairly small structured data and displayed separately.
this is for AI agent work though. That's cool, but not every team that wants better UX for complex work uses agents. Even if it "just works" for real scenarios, the marketing could be better.
Fair. There are users who simply just use the diff and integrated GitHub view/comment/approval-sync experience for local reviews of PRs. But it's _marketed_ as an integrated agent experience.
"Building" is always easier when you have a community that is ready and able to rout out bugs and suggest new features. Closed source makes that much less practical and appealing for most.
Totally get that, still something we're actively talking about!
Sort of related to that, we've been thinking a lot about the future of code review for OSS. Its clear with Cal.com going closed source that something needs to change. Would love to hear any thoughts you have
Interesting app, I have a weird bug I'm seeing with the homepage, when I tab between the chapters, it lags a bit then doesn't actually proceed to the next chapter until I press again
Thanks! I think we're really focused on making the overall review experience as guided and obvious as possible for the human. Chapters is a great start but we're coming up with more ideas on how we can make the process even easier
Yeah, but we're a small company and sometimes cut corners to move faster, so if a tool can solve this instead of potentially adding more friction to other engineers I'm all for it.
The idea of a workplace where people can’t be bothered to read what the ai is coding but someone else is expected to read and understand if it’s good or slop just doesn’t really add up.
I personally see the value of code review but I promise you the most vocal vibe coders I work with don’t at all and really it feels like something that could be just automated to even me.
The age of someone gatekeeping the codebase and pushing their personal coding style foibles on the rest of the team via reviews doesn’t feels like something that will exist anymore if your ceo is big on vibe coding.
Agree that agents are definitely handling more and more of the coding side, and there's almost no doubt they will get better slop-wise.
In our view, even vibe coders should understand how the codebase works, and we think review is a natural place to pause and make sure you know what you and your coworkers are shipping. And we should have tools to reduce the mental load as much as possible.
Do you think there's a problem of cognitive debt among your coworkers who aren't reading the code or reviewing PRs?
Looks amazing. I've been trying different stacking PR tools and Graphite and this looks to be the most human-centric so far. I'll have a shot at using this within our team soon. Congrats on the launch!
This is really cool and we definitely have this problem as well. I really like the flowchart deciding on where to put each learning. Will have to try it out!
Do you find that this list of learnings that end up BUGBOT.md or LESSONS.md ever gets too long? Or does it do a good job of deduplicating redundant learnings?
Much more interesting part is how exactly you map Context/Why/Verify to a product spec / acceptance criterions.
And I already posted how to do this. SCIP indexes from product spec -> ACs -> E2E tests -> Evidence Artifacts -> Review (approve/reject, reason) -> if all green then we make a commit that has #context + #why + #verify (I believe this is just points to e2e specs that belong to this AC)
What I'm trying to visualize is exactly where the cognitive bottleneck happens. So far I've identified three edges:
1. Spec <-> AC (User can shorten URL -> which ACs make this happen?)
2. AC <-> Plan (POST /urls/new must create new DB record and respond with 200) -> how exactly this code must look like?
3. Plan/Execute/Verify -> given this E2E test, how can I verify that test doing what AC assumes?
The cognitive bottleneck is when we transforming artifacts:
- Real world requirements (user want to use a browser) -> Spec (what exactly matters?)
- Spec -> AC (what exactly scenarios we are supporting?)
And you can see on every step we are "compressing" something ambiguous into something deterministic. That's exactly what is going on in Engineer's head. And so my tooling that I'm gonna release soon is targeted exactly to eliminate parts that can we spend most time on: "figuring out how this file connects to the Spec I have in my head, that I built from poorly described commit messages, outdated documents, Slack threads from 2016, and that guy who seemingly knowed everything before he left the company".
I've been stuck with this problem for 10+ years now. I'm tired I'm exhausted.
Every single time I come to new company, nobody writes spec and expects you to magically understand how app works, just by reading legacy artifacts.
But in my experience, engineers writing quality artifacts are rare. Usually it's Staff+ where they understand importance of proper context.
So I want to show the world it's not that hard to do, and while doing so I believe I can automate like at least 50% of the pain everyone is having while "reverse-engineering specs/ACs from code".
The initial thought I had: "what if I edit a file and I immediately can tell which spec is gonna blow up?". That led me to SCIP indexes. OK cool, how do you actually connect Spec to codebase?
I've stuck for a long time for this one.
Then I figured ok this can be just Routes! Rails routes. Django routes. Whatever. The entry level to your application is the edge between business logic and your client-side app.
1. /auth -> Spec that explains how auth works
2. POST /urls/new + GET /urls -> Spec that explains how to create new URLs
OK cool... what next? Spec captures real world. It is lossy compression. And it is still not enough to go from spec -> code directly.
Spec is implemented with Acceptance Criterions.
What is AC?
It's E2E test with the concrete steps user makes to show "this AC works".
OK cool. How do I verify "AC works"?
Evidence Artifacts.
When you create `POST /urls/new`, what do you expect to happen?
1. `insert into urls values (...)`
2. HTTP status 200
3. UI message "URL created"
4. ... whatever else you care about
Cool! Now what? How does that help me understand code I'm looking at?
Simple. One AC -> map to functions/classes/symbols that are used during execution (ever heard of traces?). So when I open a file, I see "okay so this file is used in AC1, AC21, AC8912, and so on". If I touch this file, I have to check those ACs. How do I check them? I run E2E tests that they point to. And I read ACs (English), and I verify what is going on in the system (code + artifacts).
So far so good?
Okay next. Let's say we reviewed that this new feature added 1 AC, modified another AC. And all good. We mark it as green. How do I commit it? Well, you go and describe exactly, what was the context behind this AC (SPEC we're touching, real-world context, anything that helps engineer to understand what is the situation we're in), why we did this change (because... business! or maybe because we made an incorrect assumption, so bug appeared, or we just doing refactor/tech debt removal), and how to test it (verify section, which could be just pointers to ACs/tests we were running).
If I haven't lost you yet, here's final part.
You can build this schema without EVER integrating it into codebase. It can live on your disk. You just work normally on your project. But once you realized "okay so this needs to work according to Spec...", you no longer have to store it in your brain, you just run Rust CLI script to create new spec object, and link files that are important. And when you are doing code review, you just run yet another CLI tool "git diff | blast-radius" -> which specs/ACs are affected, what should I be testing? And keeping specs/ACs mapping to codebase is simple: when something changes, you need to update index, and new files must be under a spec, deleted files should be removed from SCIP, edited files should be re-tested, to see whether they still belong to same spec or you need to update the maps.
I believe this already provides value even if nobody else is using it. I'm giving away structure for your spec/ac/e2e/evidence/review artifacts. For free.
I just want proper feedback on this. What do you think?
edit: I will make a proper post on HN after I build the MVP with instructions on how to use and example workflows (greenfield project / legacy project with 10-500k LoC), the code would be Apache/MIT whatever. And later maybe I'll try to build something like Stage with nice UI and so on. But I want to solve this problem first, which is "teaching people to leave proper work artifacts and connecting their work from requirements to code, all the way through".
If you've worked from a plan, it's trivial. I've got a setup where the agent reads the implementation plan, then creates a commit history based on intent rather than location.
Exactly. "Why was this change made"? "What were the options"? "Why this is a good way of doing it"? "What are the subtle things I came across while making this change"?
There isn't one. Most of time you would pair review a PR with human who wrote it and they could explain that. They can't anymore since 9/10, they didn't think through those things.
Chapters are regenerated every time a new commit is pushed to a PR. Our thinking is that the chapters should serve as "auto stacked diffs" since they should follow a logical order.
Do you or your team use stacking in your workflows?
It's possible, but at the same time it's been years and they haven't copied things like Graphite's dashboard or stacked PR interface yet. We have the advantage of speed :)
Yeah it's getting easier to have agents just add "one more thing" to a PR, and I think there is still an aspect of human engineering judgement to know when to break up PRs versus trusting AI/tools to keep velocity high.
In the ideal world, each PR is as small and self-contained as possible but we've noticed people struggling to justify the extra overhead every time.
Now that we are all eating Soylent it can get a little bland sometime. That’s why we are releasing our international, curated spice package for your Soylent...
I’ve built this into a cli TUI. Passes the whole diff to Claude code with a schema and gets a structured narrative back out. Works really well for understanding.
Reconstituting messy things is exactly where LLMs can help.
Thanks! Yeah we believe strongly that humans need to be in the code review loop to some extent
I think one thing we've seen from early users that surprised us is how chapters was quickly becoming the unit of review for them as opposed to files - and they've asked us to add functionality to mark chapters as viewed and comment on them as a whole
Another big surprise: now that agents are the ones writing most (if not all) the code right now, we've found that a lot of early users are using Stage to not only review others PRs but also their own PRs, before they have others review it
It's an interesting idea, but I feel like it's missing almost the most important thing; the context of the change itself. When I review a change, it's almost never just about the actual code changes, but reviewing it in the context of what was initially asked, and how it relates to that.
Your solution here seems to exclusively surface "what" changes, but it's impossible for me to know if it's right or not, unless I also see the "how" first and/or together with the change itself. So the same problem remains, except instead of reviewing in git/GitHub/gerrit + figure out the documents/resources that lays out the task itself, I still have to switch and confirm things between the two.
I agree, that's also really important and something we're brainstorming
Currently on Stage we also generate a PR summary next to the chapters and that's where we want to do more "why" that pulls in context from Linear, etc.
And I know there's a lot of cool teams like Mesa and Entire working on embedding agent context into git history itself so that could an interesting area to explore as well
I assume this problem could be solved if we write up what we actually want (like a GH issue) and maybe in the future the guys at Stage could use github issues as part of their PR review?
Yep! Or Linear, etc Or could be something like git-ai which captures agent context in git commits
> more and more engineers are merging changes that they don't really understand
You cannot solve this problem by adding more AI on top. If lack of understanding is the problem, moving people even further away will only worsen the situation.
I agree, and that's why we're not building a code review bot which aims to take humans out of the loop
We don't think of Stage as moving people further away from code review, but rather using AI to guide human attention through the review process itself
Nobody thought of the other stages as that either. It still happened.
AI guiding human attention means that humans aren't guiding human attention, which means less human understanding of their reviews.
That leaves no solution when the quantity becomes more than any human can review.
This is like complaining that someone doesn't have a solution for the foot injuries caused by repeatedly shooting yourself in the foot.
If your team is shooting each other's feet and you can't stop them, I guess this would be a foot to air interceptor for some of the bullets.
The number of solutions remains constant, because the OP isn't providing a working solution.
If I'm reviewing AI code, I don't want AI summaries. I want to be able to read the code and understand what it does. If I can't do that, the code the AI output isn't very good. In theory, your AI changes should be smaller chunks just like a real developer would do.
Quite the opposite, AI should work longer and interrupt the human less often (Tokens are cheap, interruptions are expensive). So we want to push the agent's horizon to infinite, now when they do make interruptions (e.g. Creating diff) it will be larger chunks and more complex, so these summaries are actually quite useful.
You can and should have both smaller chunks and a larger time horizon. The AI should output code in a format that's easy to review.
The Linux kernel submission guidelines are one thing you can feed the AI to guide that. Work is submitted as a patch set, and each patch in the set must be small and self contained. Many patch sets are over 50 patches.
I agree with that - with Stage we're not trying to replace reading code with AI summaries, but rather guiding the reviewer through reading code in the way that makes most sense and coming away with the best understanding
How do you handle the problem of AI misleading by design? For example, Claude already lies on a regular basis specifically (and quite convincingly) in this case, in attempts to convince that what is actually broken isn't such a big deal after all or similar.
How can this product possibly improve the status quo of AI constantly, without end, trying to 'squeak things by' during any and all human and automated review processes? That is, you are giving the AI which already cheats like hell a massive finger on the scale to cheat harder. How does this not immediately make all related problems worse?
The bulk of difficulty in reviewing AI outputs is escaping the framing they never stop trying to apply. It's never just some code. It's always some code that is 'supposed to look like something', alongside a ton of convincing prose promising that it _really_ does do that thing and a bunch of reasons why checking the specific things that would tell you it doesn't isn't something you should do (hiding evidence, etc).
99% of the problem is that the AI already has too much control over presentation when it is motivated about the result of eval. How does giving AI more tools to frame things in a narrative form of its choice and telling you what to look at help? I'm at a loss.
The quantity of code has never been a problem. Or prose. It's that all of it is engineered to mislead / hide things in ways that require a ton of effort to detect. You can't trust it and there's no equivalent of a social cost of 'being caught bullshitting' like you have with real human coworkers. This product seems like it takes that problem and turns the dial to 11.
Thanks for sharing this, I do agree with a lot of what you said especially around trust around what its actually telling you
For me, I only run into problems of an agent misleading/lying to me when working on a large feature, where the agent has strong incentive to lie and pretend like the work is done. However, there doesn't seem to be this same incentive for a completely separate agent that is just generating a narrative of a pull request. Would love to hear what you think
There is no separation. Incentive propagates through LLMs with approximately 0 resistance. If the input tells a story, the output tends to that story reinforced.
The code/PR generator is heavily incentivized to spin by RL on humans - as soon as that spin comes into contact with your narrative gen context, it's cooked. Any output that has actually seen the spin is tainted and starts spinning itself. And then there's also spin originating in the narrative gen... Hence, the examples read like straight advertisements, totally contaminated, shot through with messaging like:
- this is solid, very trustworthy
- you can trust that this is reliable logic with a sensible, comprehensible design
- the patterns are great and very professional and responsible
- etc
If the narrative reads like a glow up photoshoot for the PR, something has gone extremely wrong. This is not conducive to fairly reviewing it. It is presented as way better than it actually is. Even if there are no outright lies, the whole thing is a mischaracterization.
RL is a hell of a drug.
Anyway, this is the problem of AI output. It cannot be trusted that the impression it presents is the reality or even a best attempt at reality. You have to carefully assemble your own view of the real reality in parallel to w/e it gives you, which is a massive pain in the ass. And if you skip that, you just continually let defects/slop through.
Worst problem mucking things up is basically that RL insights that work on people also work on AI, because the AI is modelling human language patterns. Reviewing slop sucks because it's filled with (working) exploits against humans. And AI cannot help because it is immediately subverted. So I guess it requires finding a way to strip out the exploits without changing mechanical details. But hard, because it saturates 100% of output at many levels of abstraction including the mechanical details.
But how do you know they’re not lying to you? What are your benchmarks for this? Experience? Anecdote? Data?
And I’m asking you in good faith - not trying to argue.
I’m thinking about these types of questions on a daily basis, and I love to see others thinking about them too.
Looks kind of neat like devon.ai review / reviewstack crossover. But as i tell every of the dozens projects trying to make a commercial review tool: i would rather spend a week vibe copying this than onboarding a tool i have to pay for and am at the mercy of whoever made it. Its just over for selling saas tools like this. For agents i also need this local not on someones cloud. Its just a matter of time until someone does it.
Thanks for the feedback! re: local vs cloud, I do think there is a cool work to be done around unifying the writing/reviewing experience locally, but we started with cloud because we designed this as a collaborative product with teams in mind
(i find)the right way to read a PR can differ a lot from project to project. it's not just about context, or syntax, or workflow...
sometimes the best entry point is the PR description or an external ticket. sometimes you need to read the code first to understand the reasoning behind the changes. sometimes the diff itself is fine, but you have to go back several PRs to see how the codebase got into its current state.
i guess like everyone said here, there no right way to do it.
but i enjoy the video and the project, kudos ;)
So, when I code review, I have a super simple Cursor command that "orients" me in the PR:
* where does the change sit from a user perspective?
* what are the bookends of the scope?
* how big is the PR?
* etc.
Once I'm "in" and understand what it does, I pepper the AI with questions:
* Why did the author do this?
* I dont understand this?
* This looks funky, can you have a look?
* etc.
The more questions I ask, the more the AI will (essentially) go "oh, I didn't think of that, in fact, looks like the issue was way more serious than I first thought, let me investigate". The more I ask, the more issues AI finds, the more issues AI finds, the more issues I find. There's no shortcuts to quality control -- the human drives the process, AI is merely (and I hate to use this term but I will) a...force multiplier.
Maybe I'm missing something obvious, but if I was going to have my team use this, I'd want someone to answer the following question
If AI is good enough to explain what the change is and call out what to focus on in the review, then why isn't AI good enough to just do the review itself?
I understand that the goal of this is to ensure there's still a human in the review cycle, but the problem I see is that suggestions will quickly turn into todo lists. Devs will read the summary, look at the what to review section, and stop reviewing code outside of things called out in the what to focus on section. If that's true, it means customers need to be able to trust that the AI has enough context to generate accurate summaries and suggestions. If the AI is able to generate accurate summaries and suggestions, then why can't we trust it to just do the review itself?
I'm not saying that to shit on the product, because I do get the logic behind it, but I think that's a question you should have a prepared answer for since I feel like I can't be the only one thinking that.
Imo human review is important for context/knowledge sharing even if a machine or tool can mechanically determine the change is reasonable
Yep, for me personally, code review was the most effective way for me to get up to speed when joining a new engineering team
No worries at all, that's a very fair point and a question we've gotten a lot!
I think our perspective is that: software design has always had a subjective element to it. There's never been a "right" way to design a system, there are always trade offs that have to be made that depend on things like business context etc.
To that extent, most engineers probably still want to be part of that decision making process and not just let agents make all the high level decisions, especially if they're responsible for the code that ultimately gets merged
One thing that comes to mind is that an AI might see the code and say "Yeah, this should compile / no obvious runtime errors", but the AI doesn't have the context to know your teams coding standards (every team has different standards). That said, there are ways to feed that context to the AI, but still risk hallucinations, etc.
I mean, that's likely where it's going.
Most of human review I see of AI code is rubber stamping at this point, the volume is too big for human to keep up. What used to take Developers a few days to do is taking a few hours so PR volume is higher and human reviewing can't keep up. At this point, human review seems like CYA then anything else, "Why yes SOC2 auditor, we review all PRs."
I'm also seeing a lot more outages as well but management is bouncing around all happy about feature velocity they are shipping so :shrug:
Haven't tried it yet, but it looks neat!
My pain points with PRs where people vibe coded something is a bit different though: - I'd like to get an idea how they prompted and developed the PR. - I want to see if for example they just took everything the AI gave them or if they interacted with it critically - I want to see some convincing proof that they tested it, e.g. manually. I.e. along the lines of what Simon describes here: https://simonwillison.net/2025/Dec/18/code-proven-to-work/ - I want to see an AI doing a review as well
Totally different part of the reviewing experience, but I would love to see PR comments (or any revisions really) be automatically synced back to the context coding agents have about a codebase or engineer. There’s no reason nowadays for an engineer or a team of engineers to make the same code quality mistake twice. We manually maintain our agents.md with codebase conventions, etc, but it’d be great not to have to do that.
100%. A big part of code review in my mind is to automate away specific mistakes and anti-patterns across a team. I think there are a lot of interesting things to be done to merge the code writing and code reviewing cycles.
I've been working on that as a small open source tool: https://github.com/smithy-ai/smithy-ai
It keeps a repository with markdown files as the agent context, makes those available (via a simple search and summarise MCP) and when closing a merge request it checks whether the context needs updating based on the review comments. If it needs updating a PR is opened on the context repository with suggested changes/additions.
> Stage automatically analyzes the diff, clusters related changes, and generates chapters.
Isn't that what commits are for? I see no reason for adding this as an after-thought. If the committers (whether human or LLM) are well-behaved, this info is already available in the PR.
In our experience, it's difficult to create well-mannered commits as you code and new ideas pop into your head or you iterate on different designs (even for LLMs). One concept we toyed around with was telling an LLM to re-do a branch using "perfect commits" right before putting up a PR. But even then you might discover new edge cases and have to tack them on as additional commits.
We thought git wasn't the right level of abstraction and decided to tackle things at the PR level instead. Curious to hear your experiences!
> In our experience, it's difficult to create well-mannered commits
Sure, it is. But it's worth it, not just for code review, but for a myriad other things: bisect, blame, log, etc.
Your tool makes one thing (the code review) easier, while decreasing people's motivation to make well-mannered commits, thus making everything else (bisect etc) worse.
I'm sure it's net positive in some cases, and I think it's net negative in other cases.
> But even then you might discover new edge cases and have to tack them on as additional commits.
Have you heard about `rebase -i` ?
> We thought git wasn't the right level of abstraction and decided to tackle things at the PR level instead. Curious to hear your experiences!
The frick is a PR abstraction? Is this a GitHub PR abstraction where the commits are squashed and the PR description is whatever was hallucinated at 5 am? Yes, that’s certainly an abstraction, aka loss of information.
You either have the information stored in the version control database or you don’t. You can curate and digest information but once it’s lost it’s lost.
People layering stuff on top of Git or Subversion makes no sense. Your AI is not so dainty and weak that it cannot write a commit message. And if it can’t then you can recuperate the information that you trashed.
I feel that grouping related change in commits can be challenging, as git really presents commits as grouping in time, not topic.
It is certainly possible to do topic-grouping in commits, but it requires significant effort to het that consistent on a team level.
I concur. I cannot accept that we are so disconnected from what we're building that we can't go back and revise our commits or something else to make it make sense.
No pricing page, you've lost my interest. Doesn't matter that there is an obscured quote on the front page. Be up front about the costs.
Totally fair, we're working on it!
I was actually recently thinking about similar idea. I am someone who started coding post LLMs and have basic technical understanding. I know what loops, variables, API, backend bla bla is. I learned bunch more since then but I am not capable of making decisions based on git diff alone. And I want to. I want to because I think increasing my skills is still super important, even in AI era. The models are getting better, but are still limited by their core design -- for now it does not seem like they will replace humans.
So getting assistance in the review, in making the decisions and giving me more clarity feels interesting.
Maybe its people like me, who became involved into coding after the LLMs who might be your niche.
One thing I dont understand, the UI/UX? Is this visible only on git itself? Or can I get it working in Codex?
Yeah this is a really interesting perspective!
We've wondered about what the review experience should look like for newly technical or non-technical people now that they are increasingly putting up PRs themselves. These people will be less opinionated about certain technical decisions in general so maybe the future looks like review processes very personalized to your experience level and your background. Definitely a lot to think about
Right now the chapters UI is only available on our website but we're exploring possible integrations and/or a desktop app
Why is this a service and not an open source project? It doesn't seem to do much other than organize your commits within a PR (could be run once on a dev machine and shipped in the code, then displayed separately) and builds a dashboard for PRs that's not too far off from what github already offers, but could also be represented with fairly small structured data and displayed separately.
Plannotator, open source runs locally, has code review: https://github.com/backnotprop/plannotator
and a code tour feature about to ship: https://x.com/backnotprop/status/2043759492744270027/video/1
- integrated comment feedback for agents
- inline chat
- integrated AI review (uses codex and claude code defaults)
Stage (op product) navigation tour is nice UX, about a day worth of work in addition to the incoming code tour.
this is for AI agent work though. That's cool, but not every team that wants better UX for complex work uses agents. Even if it "just works" for real scenarios, the marketing could be better.
Fair. There are users who simply just use the diff and integrated GitHub view/comment/approval-sync experience for local reviews of PRs. But it's _marketed_ as an integrated agent experience.
Open source is something we're thinking about! We've just been focused on building for now but its definitely not off the table
"Building" is always easier when you have a community that is ready and able to rout out bugs and suggest new features. Closed source makes that much less practical and appealing for most.
Totally get that, still something we're actively talking about!
Sort of related to that, we've been thinking a lot about the future of code review for OSS. Its clear with Cal.com going closed source that something needs to change. Would love to hear any thoughts you have
Cal.com going closed source was, without a doubt, shortsighted and unwise. I would recommend the blog post from the maintainers of Discourse on this.
Translation: we're hoping for an acqui-hire from some rich company, and will opensource this thing if it flops.
Pretty neat, sick of trying to digest 100 Devin comments at once!
Interesting app, I have a weird bug I'm seeing with the homepage, when I tab between the chapters, it lags a bit then doesn't actually proceed to the next chapter until I press again
Sorry to hear that! Looking into it
This is a really cool idea but where's the moat? What's stopping someone from replicating the functionality?
Thanks! I think we're really focused on making the overall review experience as guided and obvious as possible for the human. Chapters is a great start but we're coming up with more ideas on how we can make the process even easier
I like the chapters thing, a lot of PRs I review should really be like 5 prs so its nice to have it auto split like that.
Do you see a world where it splits them up on the git level?
> a lot of PRs I review should really be like 5 prs
Can't you push back on that? I feel like this tool is trying to fix misbehaved colleagues...
Yeah, but we're a small company and sometimes cut corners to move faster, so if a tool can solve this instead of potentially adding more friction to other engineers I'm all for it.
Yeah that could be useful, especially with the increased popularity of stacked PRs
But I see it working together with chapters, not instead of bc it's still good to see the granularity within a PR
The idea of a workplace where people can’t be bothered to read what the ai is coding but someone else is expected to read and understand if it’s good or slop just doesn’t really add up.
I personally see the value of code review but I promise you the most vocal vibe coders I work with don’t at all and really it feels like something that could be just automated to even me.
The age of someone gatekeeping the codebase and pushing their personal coding style foibles on the rest of the team via reviews doesn’t feels like something that will exist anymore if your ceo is big on vibe coding.
Agree that agents are definitely handling more and more of the coding side, and there's almost no doubt they will get better slop-wise.
In our view, even vibe coders should understand how the codebase works, and we think review is a natural place to pause and make sure you know what you and your coworkers are shipping. And we should have tools to reduce the mental load as much as possible.
Do you think there's a problem of cognitive debt among your coworkers who aren't reading the code or reviewing PRs?
Looks amazing. I've been trying different stacking PR tools and Graphite and this looks to be the most human-centric so far. I'll have a shot at using this within our team soon. Congrats on the launch!
Thank you! Let us know any ways we can make it better
We have the same problem, and I came up with this:
https://sscarduzio.github.io/pr-war-stories/
Basically it’s distilling knowledge from pr reviews back into Bugbot fine tuning and CLAUDE.md
So the automatic review catches more, and code assistant produces more aligned code.
This is really cool and we definitely have this problem as well. I really like the flowchart deciding on where to put each learning. Will have to try it out!
Do you find that this list of learnings that end up BUGBOT.md or LESSONS.md ever gets too long? Or does it do a good job of deduplicating redundant learnings?
Thanks! We have ~1000PRs/year. Seniors are way less than juniors and a lot of knowledge is transferred via pr messages.
The deduplication and generalisation steps really help, and the extra bugbot context ends up in just about 2000 tok.
Global LESSONS.md has less than 20 “pearls” with brief examples
Nice! Will try it out
This is mostly solved just by writing proper commit messages: https://blog.br11k.dev/2026-03-23-code-review-bottleneck-par...
Much more interesting part is how exactly you map Context/Why/Verify to a product spec / acceptance criterions.
And I already posted how to do this. SCIP indexes from product spec -> ACs -> E2E tests -> Evidence Artifacts -> Review (approve/reject, reason) -> if all green then we make a commit that has #context + #why + #verify (I believe this is just points to e2e specs that belong to this AC)
Here's full schema: https://tinyurl.com/4p43v2t2 (-> https://mermaid.ai/live/edit)
What I'm trying to visualize is exactly where the cognitive bottleneck happens. So far I've identified three edges:
1. Spec <-> AC (User can shorten URL -> which ACs make this happen?)
2. AC <-> Plan (POST /urls/new must create new DB record and respond with 200) -> how exactly this code must look like?
3. Plan/Execute/Verify -> given this E2E test, how can I verify that test doing what AC assumes?
The cognitive bottleneck is when we transforming artifacts:
- Real world requirements (user want to use a browser) -> Spec (what exactly matters?)
- Spec -> AC (what exactly scenarios we are supporting?)
And you can see on every step we are "compressing" something ambiguous into something deterministic. That's exactly what is going on in Engineer's head. And so my tooling that I'm gonna release soon is targeted exactly to eliminate parts that can we spend most time on: "figuring out how this file connects to the Spec I have in my head, that I built from poorly described commit messages, outdated documents, Slack threads from 2016, and that guy who seemingly knowed everything before he left the company".
> This is mostly solved just by writing proper commit messages
This argument reminds me of the HN Dropbox announcement top comment:
https://news.ycombinator.com/item?id=9224
Yeah it feels like it but I did my research!
I've been stuck with this problem for 10+ years now. I'm tired I'm exhausted. Every single time I come to new company, nobody writes spec and expects you to magically understand how app works, just by reading legacy artifacts.
But in my experience, engineers writing quality artifacts are rare. Usually it's Staff+ where they understand importance of proper context.
So I want to show the world it's not that hard to do, and while doing so I believe I can automate like at least 50% of the pain everyone is having while "reverse-engineering specs/ACs from code".
The initial thought I had: "what if I edit a file and I immediately can tell which spec is gonna blow up?". That led me to SCIP indexes. OK cool, how do you actually connect Spec to codebase?
I've stuck for a long time for this one.
Then I figured ok this can be just Routes! Rails routes. Django routes. Whatever. The entry level to your application is the edge between business logic and your client-side app.
1. /auth -> Spec that explains how auth works 2. POST /urls/new + GET /urls -> Spec that explains how to create new URLs
OK cool... what next? Spec captures real world. It is lossy compression. And it is still not enough to go from spec -> code directly.
Spec is implemented with Acceptance Criterions.
What is AC? It's E2E test with the concrete steps user makes to show "this AC works".
OK cool. How do I verify "AC works"?
Evidence Artifacts.
When you create `POST /urls/new`, what do you expect to happen?
1. `insert into urls values (...)` 2. HTTP status 200 3. UI message "URL created" 4. ... whatever else you care about
Cool! Now what? How does that help me understand code I'm looking at?
Simple. One AC -> map to functions/classes/symbols that are used during execution (ever heard of traces?). So when I open a file, I see "okay so this file is used in AC1, AC21, AC8912, and so on". If I touch this file, I have to check those ACs. How do I check them? I run E2E tests that they point to. And I read ACs (English), and I verify what is going on in the system (code + artifacts).
So far so good?
Okay next. Let's say we reviewed that this new feature added 1 AC, modified another AC. And all good. We mark it as green. How do I commit it? Well, you go and describe exactly, what was the context behind this AC (SPEC we're touching, real-world context, anything that helps engineer to understand what is the situation we're in), why we did this change (because... business! or maybe because we made an incorrect assumption, so bug appeared, or we just doing refactor/tech debt removal), and how to test it (verify section, which could be just pointers to ACs/tests we were running).
If I haven't lost you yet, here's final part.
You can build this schema without EVER integrating it into codebase. It can live on your disk. You just work normally on your project. But once you realized "okay so this needs to work according to Spec...", you no longer have to store it in your brain, you just run Rust CLI script to create new spec object, and link files that are important. And when you are doing code review, you just run yet another CLI tool "git diff | blast-radius" -> which specs/ACs are affected, what should I be testing? And keeping specs/ACs mapping to codebase is simple: when something changes, you need to update index, and new files must be under a spec, deleted files should be removed from SCIP, edited files should be re-tested, to see whether they still belong to same spec or you need to update the maps.
I believe this already provides value even if nobody else is using it. I'm giving away structure for your spec/ac/e2e/evidence/review artifacts. For free.
I just want proper feedback on this. What do you think?
edit: I will make a proper post on HN after I build the MVP with instructions on how to use and example workflows (greenfield project / legacy project with 10-500k LoC), the code would be Apache/MIT whatever. And later maybe I'll try to build something like Stage with nice UI and so on. But I want to solve this problem first, which is "teaching people to leave proper work artifacts and connecting their work from requirements to code, all the way through".
Hope that makes sense!
And thanks for reading.
Hmm. All of the examples simply describe what the code is doing. I need a tool that explains the intent and context behind a change.
If you've worked from a plan, it's trivial. I've got a setup where the agent reads the implementation plan, then creates a commit history based on intent rather than location.
Exactly. "Why was this change made"? "What were the options"? "Why this is a good way of doing it"? "What are the subtle things I came across while making this change"?
Yep that's something we're actively working on! would love to hear any perspectives on best ways to approach this
There isn't one. Most of time you would pair review a PR with human who wrote it and they could explain that. They can't anymore since 9/10, they didn't think through those things.
Does Stage work for PRs that have multiple commits? These could be considered "stacked diffs", but in the same PR.
Chapters are regenerated every time a new commit is pushed to a PR. Our thinking is that the chapters should serve as "auto stacked diffs" since they should follow a logical order.
Do you or your team use stacking in your workflows?
I assume Gitlab/Github will add these sort of features to their products within the next few months
It's possible, but at the same time it's been years and they haven't copied things like Graphite's dashboard or stacked PR interface yet. We have the advantage of speed :)
Y’all are a bit nuts if you want 50% more per month than Claude Pro for this.
Really like this idea. But at what point do you think its valuable to have this chapters breakdown versus splitting things up into multiple PRs?
Yeah it's getting easier to have agents just add "one more thing" to a PR, and I think there is still an aspect of human engineering judgement to know when to break up PRs versus trusting AI/tools to keep velocity high.
In the ideal world, each PR is as small and self-contained as possible but we've noticed people struggling to justify the extra overhead every time.
Can reviewers adjust the chapter splits manually if they disagree with how it grouped the PR, or are the chapters fixed once generated?
We don't support that currently, but would love to see examples where you disagree with the chapters so we can figure out the best interface
You can regenerate the chapters anytime, but it might lead to similar results as the first time
We're also planning on adding functionality to support some sort of CHAPTERS.md file that lets you specify how you want things broken down!
CHAPTERS.md sounds like a good idea for when the auto-grouping doesn't match the actual structure of the work. Looking forward to seeing it.
“Putting the cuisine back in food”
Looks inside.
Now that we are all eating Soylent it can get a little bland sometime. That’s why we are releasing our international, curated spice package for your Soylent...
I’ve built this into a cli TUI. Passes the whole diff to Claude code with a schema and gets a structured narrative back out. Works really well for understanding.
Reconstituting messy things is exactly where LLMs can help.
easier: dont do vibe coding or allow AI bots
better: break up codebase into areas over which certain engs will "own" code reviews over. divy up burden
best: hire best folks, mentor them
[flagged]
(see https://news.ycombinator.com/newsguidelines.html#generated and https://news.ycombinator.com/item?id=47340079)
Thanks! Yeah we believe strongly that humans need to be in the code review loop to some extent
I think one thing we've seen from early users that surprised us is how chapters was quickly becoming the unit of review for them as opposed to files - and they've asked us to add functionality to mark chapters as viewed and comment on them as a whole
Another big surprise: now that agents are the ones writing most (if not all) the code right now, we've found that a lot of early users are using Stage to not only review others PRs but also their own PRs, before they have others review it