Coding Agent VMs on NixOS with Microvm.nix

(michael.stapelberg.ch)

58 points | by secure 3 days ago

8 comments

the_harpia_io 3 hours ago

The sandbox-or-not debate is important but it's only half the picture. Even a perfectly sandboxed agent can still generate code with vulnerabilities that get deployed to production - SQL injection, path traversal, hardcoded secrets, overly permissive package imports.
The execution sandbox stops the agent from breaking out during development, but the real risk is what gets shipped downstream. Seeing more tools now that scan the generated code itself, not just contain the execution environment.

[-]
- nh2 35 minutes ago
  
  I find that a bit of a weird point.
  The goal of such sandboxing is that you can allow the agent to freely write/execute/test code during development, so that it can propose a solution/commit without the human having to approve every dangerous step ("write a Python file, then execute it" is already a dangerous step). As the post says: "To safely run a coding agent without review".
  You would then review the code, and use it if it's good. Turning many small reviews where you need to be around and babysit every step into a single review at the end.
  What you seem to be asking for (shipping the generated code to production without review) is a completely different goal and probably a bad idea.
  If there really were a tool that can "scan the generated code" so reliably that it is safe to ship without human review, then that could just be part of the tool that generates the code in the first place so that no code scanning would be necessary. Sandboxing wouldn't be necessary either then. So then sandboxing wouldn't be "half the picture"; it would be unnecessary entirely, and your statement simplifies to "if we could auto-generate perfect code, we wouldn't need any of this".
- ryanrasti 1 hour ago
  
  Precisely! There's a fundamental tension: 1. Agents need to interact with the outside world to be useful 2. Interacting with the outside world is dangerous
  Sandboxes provide a "default-deny policy" which is the right starting point. But, current tools lack the right primitives to make fine grained data-access and data policy a reality.
  Object-capabilities provide the primitive for fine-grained access. IFC (information flow control) for dataflow.
rootnod3 4 hours ago

That is quite an involved setup to get a costly autocomplete going.
Is that really where we are at? Just outsource convenience to a few big players that can afford the hardware? Just to save on typing and god forbid…thinking?
“Sorry boss, I can’t write code because cloudflare is down.”

[-]
- Cyph0n 2 hours ago
  
  Keep in mind that this setup is a one-time cost. Also, a lot of the code is related to configuring it the way the author wants it (via Home Manager).
  Generally speaking, once you have a working NixOS config, incremental changes become extremely trivial, safe, and easy to rollback.
  
  [-]
  - aquariusDue 53 minutes ago
    
    To provide another data point: I too use NixOS and oh boy that one-time is really costly. And while we're sharing Nix stuff for LLMs there's this piece of kit too: https://github.com/YPares/rigup.nix
0xcb0 2 hours ago

I was looking for a way to isolate my agents in a more convenient way, and I really love your idea. I'm going to give this a try over the weekend and will report back.
But the one-time setup seems like a really fair investment for having a more secure development. Of course, what concerns the problem of getting malicious code to production, this will not help. But this will, with a little overhead, I think, really make development locally much more secure.
And you can automate it a lot. And it will be finally my chance to get more into NixOS :D
mxs_ 1 hour ago

I there a way to make this work with macOS hosts, preferably without having to install a Linux toolchain inside the VM for the language the agent will be writing code in?

[-]
- mtlynch 53 minutes ago
  
  This is a similar macOS solution:
  https://github.com/lynaghk/vibe/
NJL3000 1 hour ago

A pair of containers felt a bit cheaper than a VM:
https://github.com/5L-Labs/amp_in_a_box
I was going to add Gemini / OpenCode Kilo next.
There is some upfront cost to define what endpoints to map inside, but it definitely adds a veneer of preventing the crazy…

[-]
- phrotoma 54 minutes ago
  
  One problem with using containers as an isolation environment for a coding assistant is that it becomes challenging to have the agent work on a containerized project. You often need some janky "docker-in-docker" nonsense that hampers efforts.
  
  [-]
  - NJL3000 8 minutes ago
    
    I was planning to have worktrees bind mounted systematically, but agree it’s not super clean atm at scale (yet)
messh 1 hour ago

I use shellbox.dev to create sandboxes through ssh, without ever leaving the terminal
heliumtera 3 hours ago

Couldn't you replicate all of your setup with qemu microvm?
Without nix I mean

[-]
- rictic 3 hours ago
  
  Yep. What nix adds is a declarative and reproducible way to build customized OS images to boot into.
  
  [-]
  - CuriouslyC 2 hours ago
    
    Nix is the best answer to "works on my machine," which is a problem I've seen at pretty much every place I've ever worked.
    
    [-]
    - 0x457 48 minutes ago
      
      It's also an answer to caching with /nix/store. I wish more cloud services supported "give me your nixosConfiguration or something similar" instead of providing api to build containers/vms imperatively. Dockerfile and everything that mimics it is my least favorite way to do this.
clawsyndicate 3 days ago

we run ~10k agent pods on k3s and went with gvisor over microvms purely for density. the memory overhead of a dedicated kernel per tenant just doesn't scale when you're trying to pack thousands of instances onto a few nodes. strict network policies and pid limits cover most of the isolation gaps anyway.

[-]
- alexzenla 52 minutes ago
  
  This is a big reason for our strategy at Edera (https://edera.dev) of building hypervisor technology that eliminates the standard x86/ARM kernel overhead in favor of deep para-virtualization.
  The performance of gVisor is often a big limiting factor in deployment.
  
  [-]
  - souvik1997 7 minutes ago
    
    Edera looks very cool! Awesome team too.
    I read the thesis on arxiv. Do you see any limitations from using Xen instead of KVM? I think that was the biggest surprise for me as I have very rarely seen teams build on Xen.
- souvik1997 1 hour ago
  
  Hey @clawsyndicate I'd love to learn more about your use case. We are working on a product that would potentially get you the best of both worlds (microVM security and containers/gVisor scalability). My email is in my profile.
  
  [-]
  - alexzenla 41 minutes ago
    
    This is the thesis of our research paper here, a good middle ground is necessary: https://arxiv.org/abs/2501.04580
- secure 3 days ago
  
  Yeah, when you run ≈10k agents instead of ≈10, you need a different solution :)
  I’m curious what gVisor is getting you in your setup — of course gVisor is good for running untrusted code, but would you say that gVisor prevents issues that would otherwise make the agent break out of the kubernetes pod? Like, do you have examples you’ve observed where gVisor has saved the day?
  
  [-]
  - zeroxfe 3 hours ago
    
    I've used both gVisor and microvms for this (at very large scales), and there are various tradeoffs between the two.
    The huge gVisor drawback is that it __drastically_ slows down applications (despite startup time being faster.)
    For agents, the startup time latency is less of an issue than the runtime cost, so microvms perform a lot better. If you're doing this in kube, then there's a bunch of other challenges to deal with if you want standard k8s features, but if you're just looking for isolated sandboxes for agents, microvms work really well.
  - clawsyndicate 2 days ago
    
    since we allow agents to execute arbitrary python, we treat every container as hostile. we've definitely seen logs of agents trying to crawl /proc or hit the k8s metadata api. gvisor intercepts those syscalls so they never actually reach the host kernel.
    
    [-]
    - alexzenla 43 minutes ago
      
      The reason why virtualization approaches with true Linux kernels is still important is what you do allow via syscalls ultimately does result in a syscall on the host system, even if through layers of indirection. Ultimately, if you fork() in gVisor, that calls fork() on the host (btw fork() execve() is expensive on gVisor still).
      The middle ground we've built is that a real Linux kernel interfaces with your application in the VM (we call it a zone), but that kernel then can make specialized and specific interface calls to the host system.
      For example with NVIDIA on gVisor, the ioctl()'s are passed through directly, with NVIDIA driver vulnerabilities that can cause memory corruption, it leads directly into corruption in the host kernel. With our platform at Edera (https://edera.dev), the NVIDIA driver runs in the VM itself, so a memory corruption bug doesn't percolate to other systems.
    - rootnod3 3 hours ago
      
      And you see no problem in that at all? Just “throw a box around it and let the potentially malicious code run”?
      Wait until they find a hole. Then good luck.
      
      [-]
      - alexzenla 38 minutes ago
        
        This is why you can't build these microVM systems to just do isolation, it has to provide more value than that. Observability, policy, etc.
- dist-epoch 3 hours ago
  
  LXC containers inside a VM scales. bonus point that LXC containers feel like a VM.