Claude Code: connect to a local model when your quota runs out

(boxc.net)

204 points | by fugu2 3 days ago

26 comments

paxys 4 hours ago

> Reduce your expectations about speed and performance!
Wildly understating this part.
Even the best local models (ones you run on beefy 128GB+ RAM machines) get nowhere close to the sheer intelligence of Claude/Gemini/Codex. At worst these models will move you backwards and just increase the amount of work Claude has to do when your limits reset.

[-]
- andai 1 hour ago
  
  Yeah this is why I ended up getting Claude subscription in the first place.
  I was using GLM on ZAI coding plan (jerry rigged Claude Code for $3/month), but finding myself asking Sonnet to rewrite 90% of the code GLM was giving me. At some point I was like "what the hell am I doing" and just switched.
  To clarify, the code I was getting before mostly worked, it was just a lot less pleasant to look at and work with. Might be a matter of taste, but I found it had a big impact on my morale and productivity.
  
  [-]
  - Aurornis 43 minutes ago
    
    > but finding myself asking Sonnet to rewrite 90% of the code GLM was giving me. At some point I was like "what the hell am I doing" and just switched.
    This is a very common sequence of events.
    The frontier hosted models are so much better than everything else that it's not worth messing around with anything lesser if doing this professionally. The $20/month plans go a long way if context is managed carefully. For a professional developer or consultant, the $200/month plan is peanuts relative to compensation.
  - MuffinFlavored 1 hour ago
    
    Did you eventually move to a $20/mo Claude plan, $100/mo Claude plan, $200/mo, or API based? if API based, how much are you averaging a month?
    
    [-]
    - andai 36 minutes ago
      
      The $20 one, but it's hobby use for me, would probably need the $200 one if I was full time. Ran into the 5 hour limit in like 30 minutes the other day.
      I've also been testing OpenClaw. It burned 8M tokens during my half hour of testing, which would have been like $50 with Opus on the API. (Which is why everyone was using it with the sub, until Anthropic apparently banned that.)
      I was using GLM on Cerebras instead, so it was only $10 per half hour ;) Tried to get their Coding plan ("unlimited" for $50/mo) but sold out...
      (My fallback is I got a whole year of GLM from ZAI for $20 for the year, it's just a bit too slow for interactive use.)
- anon373839 24 minutes ago
  
  It's true that open models are a half-step behind the frontier, but I can't say that I've seen "sheer intelligence" from the models you mentioned. Just a couple of days ago Gemini 3 Pro was happily writing naive graph traversal code without any cycle detection or safety measures. If nothing else, I would have thought these models could nail basic algorithms by now?
- zozbot234 4 hours ago
  
  The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago. That's not "nothing" and is plenty good enough for everyday work.
  
  [-]
  - Aurornis 46 minutes ago
    
    > The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago
    Kimi K2.5 is a trillion parameter model. You can't run it locally on anything other than extremely well equipped hardware. Even heavily quantized you'd still need 512GB of unified memory, and the quantization would impact the performance.
    Also the proprietary models a year ago were not that good for anything beyond basic tasks.
  - reilly3000 4 hours ago
    
    Which takes a $20k thunderbolt cluster of 2 512GB RAM Mac Studio Ultras to run at full quality…
    
    [-]
    - 0xbadcafebee 2 hours ago
      
      Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.
      
      [-]
      - MuffinFlavored 1 hour ago
        
        > You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.
        Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?
        
        [-]
        
        bigyabai 1 hour ago
        
        > Are there a lot of options how "how far" do you quantize?
        So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...
        > How much VRAM does it take to get the 92-95% you are speaking of?
        For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.
        
        [-]
        
        MuffinFlavored 51 minutes ago
        
        Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?
    - bigyabai 2 hours ago
      
      "Full quality" being a relative assessment, here. You're still deeply compute constrained, that machine would crawl at longer contexts.
    - teaearlgraycold 4 hours ago
      
      Which while expensive is dirt cheap compared to a comparable NVidia or AMD system.
      
      [-]
      - SchemaLoad 4 hours ago
        
        It's still very expensive compared to using the hosted models which are currently massively subsidised. Have to wonder what the fair market price for these hosted models will be after the free money dries up.
        
        [-]
        
        cactusplant7374 3 hours ago
        
        Inference is profitable. Maybe we hit a limit and we don't need as many expensive training runs in the future.
        
        [-]
        
        paxys 2 hours ago
        
        Inference APIs are probably profitable, but I doubt the $20-$100 monthly plans are.
        
        teaearlgraycold 2 hours ago
        
        For sure Claude Code isn’t profitable
        
        [-]
        
        bdangubic 2 hours ago
        
        Neither was Uber and … and …
        
        [-]
        
        plagiarist 1 hour ago
        
        Businesses will desire me for my insomnia once Anthropics starts charging congestion pricing.
      - blharr 4 hours ago
        
        What speed are you getting at that level of hardware though?
  - 0xbadcafebee 2 hours ago
    
    Kimi K2.5 is fourth place for intelligence right now. And it's not as good as the top frontier models at coding, but it's better than Claude 4.5 Sonnet. https://artificialanalysis.ai/models
  - corysama 3 hours ago
    
    The article mentions https://unsloth.ai/docs/basics/claude-codex
    I'll add on https://unsloth.ai/docs/models/qwen3-coder-next
    The full model is supposedly comparable to Sonnet 4.5 But, you can run the 4 bit quant on consumer hardware as long as your RAM + VRAM has room to hold 46GB. 8 bit needs 85.
  - paxys 4 hours ago
    
    LOCAL models. No one is running Kimi 2.5 on their Macbook or RTX 4090.
    
    [-]
    - DennisP 2 hours ago
      
      On Macbooks, no. But there are a few lunatics like this guy:
      https://www.youtube.com/watch?v=bFgTxr5yst0
  - teaearlgraycold 4 hours ago
    
    Having used K2.5 I’d judge it to be a little better than that. Maybe as good as proprietary models from last June?
- bityard 3 hours ago
  
  Correct, a rack full of datacenter equipment is not going to compete with anything that fits on your desk or lap. Well spotted.
  But as a counterpoint: there are whole communities of people in this space who get significant value from models they run locally. I am one of them.
  
  [-]
  - kamov 2 hours ago
    
    What do you use local models for? I'm asking generally about possible applications of these smaller models
  - Gravey 3 hours ago
    
    Would you mind sharing your hardware setup and use case(s)?
    
    [-]
    - CamperBob2 2 hours ago
      
      Not the GP but the new Qwen-Coder-Next release feels like a step change, at 60 tokens per second on a single 96GB Blackwell. And that's at full 8-bit quantization and 256K context, which I wasn't sure was going to work at all.
      It is probably enough to handle a lot of what people use the big-3 closed models for. Somewhat slower and somewhat dumber, granted, but still extraordinarily capable. It punches way above its weight class for an 80B model.
      
      [-]
      - redwood_ 2 hours ago
        
        Agree, these new models are a game changer. I switched from Claude to Qwen3-Coder-Next for day-to-day on dev projects and don't see a big difference. Just use Claude when I need comprehensive planning or review. Running Qwen3-Coder-Next-Q8 with 256K context.
      - zozbot234 2 hours ago
        
        IIRC, that new Qwen model has 3B active parameters so it's going to run well enough even on far less than 96GB VRAM. (Though more VRAM may of course help wrt. enabling the full available context length.) Very impressive work from the Qwen folks.
- seanmcdirmid 1 hour ago
  
  > (ones you run on beefy 128GB+ RAM machines)
  PC or Mac? A PC, ya, no way, not without beefy GPUs with lots of VRAM. A mac? Depends on the CPU, an M3 Ultra with 128GB of unified RAM is going to get closer, at least. You can have decent experiences with a Max CPU + 64GB of unified RAM (well, that's my setup at least).
  
  [-]
  - QuantumNomad_ 1 hour ago
    
    Which models do you use, and how do you run them?
- mycall 1 hour ago
  
  There is tons of improvements in the near future. Even Claude Code developer said he aimed at delivering a product that was built for future models he betted would improve enough to fulfill his assumptions. Parallel vLLM MoE local LLMs on a Strix Halo 128GB has some life in it yet.
- 0xbadcafebee 2 hours ago
  
  The best local models are literally right behind Claude/Gemini/Codex. Check the benchmarks.
  That said, Claude Code is designed to work with Anthropic's models. Agents have a buttload of custom work going on in the background to massage specific models to do things well.
  
  [-]
  - girvo 50 minutes ago
    
    The benchmarks simply do not match my experience though. I don’t put that much stock in them anymore.
- richstokes 2 hours ago
  
  This. It's a false economy if you value your time even slightly, pay for the extra tokens and use the premium models.
- dheera 3 hours ago
  
  Maybe add to the Claude system prompt that it should work efficiently or else its unfinished work will be handed off to to a stupider junior LLM when its limits run out, and it will be forced to deal with the fallout the next day.
  That might incentivize it to perform slightly better from the get go.
  
  [-]
  - kridsdale3 3 hours ago
    
    "You must always take two steps forward, for when you are off the clock, your adversary will take one step back."
- bicx 4 hours ago
  
  Exactly. The comparison benchmark in the local LLM community is often GPT _3.5_, and most home machines can’t achieve that level.
- DANmode 2 hours ago
  
  and you really should be measuring based on the worst-case scenario for tools like this.
- nik282000 4 hours ago
  
  > intelligence
  Whether it's a giant corporate model or something you run locally, there is no intelligence there. It's still just a lying engine. It will tell you the string of tokens most likely to come after your prompt based on training data that was stolen and used against the wishes of its original creators.
alexhans 6 hours ago

Useful tip.
From a strategic standpoint of privacy, cost and control, I immediately went for local models, because that allowed to baseline tradeoffs and it also made it easier to understand where vendor lock-in could happen, or not get too narrow in perspective (e.g. llama.cpp/open router depending on local/cloud [1] ).
With the explosion of popularity of CLI tools (claude/continue/codex/kiro/etc) it still makes sense to be able to do the same, even if you can use several strategies to subsidize your cloud costs (being aware of the lack of privacy tradeoffs).
I would absolutely pitch that and evals as one small practice that will have compounding value for any "automation" you want to design in the future, because at some point you'll care about cost, risks, accuracy and regressions.
[1] - https://alexhans.github.io/posts/aider-with-open-router.html
[2] - https://www.reddit.com/r/LocalLLaMA

[-]
- lancekey 2 hours ago
  
  Can you say a bit more about evals and your approach?
- mogoman 6 hours ago
  
  can you recommend a setup with ollama and a cli tool? Do you know if I need a licence for Claude if I only use my own local LLM?
  
  [-]
  - alexhans 5 hours ago
    
    What are your needs/constraints (hardware constraints definitely a big one)?
    The one I mentioned called continue.dev [1] is easy to try out and see if it meets your needs.
    Hitting local models with it should be very easy (it calls APIs at a specific port)
    [1] - https://github.com/continuedev/continue
    
    [-]
    - wongarsu 4 hours ago
      
      I've also made decent experiences with continue, at least for autocomplete. The UI wants you to set up an account, but you can just ignore that and configure ollama in the config file
      For a full claude code replacement I'd go with opencode instead, but good models for that are something you run in your company's basement, not at home
  - drifkin 5 hours ago
    
    we recently added a `launch` command to Ollama, so you can set up tools like Claude Code easily: https://ollama.com/blog/launch
    tldr; `ollama launch claude`
    glm-4.7-flash is a nice local model for this sort of thing if you have a machine that can run it
    
    [-]
    - vorticalbox 5 hours ago
      
      I have been using glm-4.7 a bunch today and it’s actually pretty good.
      I set up a bot on 4claw and although it’s kinda slow, it took twenty minutes to load 3 subs and 5 posts from each then comment on interesting ones.
      It actually managed to correctly use the api via curl though at one point it got a little stuck as it didn’t escape its json.
      I’m going to run it for a few days but very impressed so for for such a small model.
- cyanydeez 5 hours ago
  
  I think control should be top of the list here. You're talking about building work flows, products and long term practices around something that's inherently non-deterministic.
  And the probability that any given model you use today is the same as what you use tomorrow is doubly doubtful:
  1. The model itself will change as they try to improve the cost-per-test improves. This will necessarily make your expectations non-deterministic.
  2. The "harness" around that model will change as business-cost is tightened and the amount of context around the model is changed to improve the business case which generates the most money.
  Then there's the "cataclysmic" lockout cost where you accidently use the wrong tool that gets you locked out of the entire ecosystem and you are black listed, like a gambler in vegas who figures out how to count cards and it works until the house's accountant identifies you as a non-negligible customer cost.
  It's akin to anti-union arguments where everyone "buying" into the cloud AI circus thinks they're going to strike gold and completely ignores the fact that very few will and if they really wanted a better world and more control, they'd unionize and limit their illusions of grandeur. It should be an easy argument to make, but we're seeing about 1/3 of the population are extremely susceptible to greed based illusions.,
  
  [-]
  - alexhans 2 hours ago
    
    You're right. Control is the big one and both privacy and cost are only possible because you have control. It's a similar benefit to the one of Linux distros or open source software.
    The rest of your points are why I mentioned AI evals and regressions. I share your sentiment. I've pitched it in the past as "We can’t compare what we can’t measure" and "Can I trust this to run on its own?" and how automation requires intent and understanding your risk profile. None of this is new for anyone who has designed software with sufficient impact in the past, of course.
    Since you're interested in combating non-determinism, I wonder if you've reached the same conclusion of reducing the spaces where it can occur and compound making the "LLM" parts as minimal as possible between solid deterministic and well-tested building blocks (e.g. https://alexhans.github.io/posts/series/evals/error-compound... ).
sathish316 1 hour ago

Some native Claude code options when your quota runs out:
1. Switch to extra usage, which can be increased on the Claude usage page: https://claude.ai/settings/usage
2. Logout and Switch to API tokens (using the ANTHROPIC_API_KEY environment variable) instead of a Claude Pro subscription. Credits can be increased on the Anthropic API console page: https://platform.claude.com/settings/keys
3. Add a second 20$/month account if this happens frequently, before considering a Max account.
4. Not a native option: If you have a ChatGPT Plus or Pro account, Codex is surprisingly just as good and comes with a much higher quota.

[-]
- girvo 48 minutes ago
  
  For me option 4 has been the move, but “just as good” I haven’t found that.
  It’s slower and about 90% as good, so it definitely works as a great back up, but CC with Opus is noticeably better for all of my workloads
sathish316 2 hours ago

Claude Code Router or ccr can connect to OpenRouter. When your quota runs out, it’s a much better speed vs quality vs cost tradeoff compared to running Qwen3 locally - https://github.com/musistudio/claude-code-router
mvkel 34 minutes ago

Why anyone wouldn't want to be using the SOTA model at all times baffles me.
Going dumb/cheap just ends up costing more, in the short and long term.
d4rkp4ttern 4 hours ago

Since Llama.cpp/llama-server recently added support for the Anthropic messages API, running Claude Code with several recent open-weight local models is now very easy. The messy part is what llama-server flags to use, including chat template etc. I've collected all of that setup info in my claude-code-tools [1] repo, for Qwen3-Coder-next, Qwen3-30B-A3B, Nemotron-3-Nano, GLM-4.7-Flash etc.
Among these, I had lots of trouble getting GLM-4.7-Flash to work (failed tool calls etc), and even when it works, it's at very low tok/s. On the other hand Qwen3 variants perform very well, speed wise. For local sensitive document work, these are excellent; for serious coding not so much.
One caviat missed in most instructions is that you have to set CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = 1 in your ~/.claude/settings.json, otherwise CC's telemetry pings cause total network failure because local ports are exhausted.
[1] claude-code-tools local LLM setup: https://github.com/pchalasani/claude-code-tools/blob/main/do...
Animats 4 hours ago

When your AI is overworked, it gets dumber. It's backwards compatible with humans.
baalimago 6 hours ago

Or better yet: Connect to some trendy AI (or web3) company's chatbot. It almost always outputs good coding tips
sorenjan 3 hours ago

Maybe you can log all the traffic to and from the proprietary models and fine tune a local model each weekend? It's probably against their terms of service, but it's not like they care where their training data comes from anyway.
Local models are relatively small, it seems wasteful to try and keep them as generalists. Fine tuning on your specific coding should make for better use of their limited parameter count.

[-]
- PlatoIsADisease 3 hours ago
  
  Is there an easy way to fine tune? I havent tried fine tuning since 2024, but it was not trivial back then.
hkpatel3 5 hours ago

Openrouter can also be used with claude code. https://openrouter.ai/docs/guides/claude-code-integration

[-]
- htsh 3 hours ago
  
  thanks! came in here to ask this.
  we can do much better with a cheap model on openrouter (glm 4.7, kimi, etc.) than anything that I can run on my lowly 3090 :)
  
  [-]
  - parthsareen 1 hour ago
    
    Also recently added ollama launch claude if you want to connect to cloud models from there :)
wkirby 5 hours ago

My experience thus far is that the local models are a) pretty slow and b) prone to making broken tool calls. Because of (a) the iteration loop slows down enough to where I wander off to do other tasks, meaning that (b) is way more problematic because I don't see it for who knows how long.
This is, however, a major improvement from ~6 months ago when even a single token `hi` from an agentic CLI could take >3 minutes to generate a response. I suspect the parallel processing of LMStudio 0.4.x and some better tuning of the initial context payload is responsible.
6 months from now, who knows?

[-]
- israrkhan 4 hours ago
  
  Open models are trained more generically to work with "Any" tool.
  Closed models are specifically tuned with tools, that model provider wants them to work with (for example specific tools under claude code), and hence they perform better.
  I think this will always be the case, unless someone tunes open models to work with the tools that their coding agent will use.
  
  [-]
  - dragonwriter 3 hours ago
    
    > Open models are trained more generically to work with "Any" tool. Closed models are specifically tuned with tools, that model provider wants them to work with (for example specific tools under claude code), and hence they perform better.
    Some open models have specific training for defined tools (a notable example is OpenAI GPT-OSS and its "built in" tools for browser use and python execution (they are called built in tools, but they are really tool interfaces it is trained to use if made available.) And closed models are also trained to work with generic tools as well as their “built in” tools.
mycall 1 hour ago

Why not do a load balanced approach two multiple models in the same chat session? As long as they both know each exists and the pattern, they could optimize their abilities on their own, playing off each other's strengths.
starkeeper 3 hours ago

Very cool. Anyone have guidance for using this with jetbrains IDE? It has a Claude Code plugin, but I think the setup is different for intelliJ... I know it has some configuration for local models, but the integrated Claude is such a superior experience then using their Junie, or just prompting diffs from the regular UI interface. HMMMM.... I guess I could try switching to the Claude Code CLI or other interface directly when my AI credits with jetbrains runs dry!
Thanks again for this info & setup guide! I'm excited to play with some local models.
TaupeRanger 4 hours ago

God no. "Connect to a 2nd grader when your college intern is too sick to work."
eek2121 5 hours ago

I gotta say, the local models are catching up quick. Claude is definitely still ahead, but things are moving right along.

[-]
- bcyn 2 hours ago
  
  Which models perform anywhere close to Opus 4.5? In my experience none of the local models are even in the same ballpark.
zingar 6 hours ago

I guess I should be able to use this config to point Claude at the GitHub copilot licensed models (including anthropic models). That’s pretty great. About 2/3 of the way through every day I’m forced to switch from Claude (pro license) to amp free and the different ergonomics are quite jarring. Open source folks get copilot tokens for free so that’s another pro license I don’t have to worry about.
btbuildem 5 hours ago

I'm confused, wasn't this already available via env vars? ANTHROPIC_BASE_URL and so on, and yes you may have to write a thin proxy to wrap the calls to fit whatever backend you're using.
I've been running CC with Qwen3-Coder-30B (FP8) and I find it just as fast, but not nearly as clever.
israrkhan 4 hours ago

Using claude code with custom models
Will it work? Yes. Will it produce same quality as Sonnet or Opus? No.
IgorPartola 3 hours ago

So I have gotten pretty good at managing context such that my $20 Claude subscription rarely runs out of its quota but I still do hit it sometimes. I use Sonnet 99% of the time. Mostly this comes down to giving it specific task and using /clear frequently. I also ask it to update its own notes frequently so it doesn’t have to explore the whole codebase as often.
But I was really disappointed when I tried to use subagents. In theory I really liked the idea: have Haiku wrangle small specific tasks that are tedious but routine and have Sonnet orchestrate everything. In practice the subagents took so many steps and wrote so much documentation that it became not worth it. Running 2-3 agents blew through the 5 hour quota in 20 minutes of work vs normal work where I might run out of quota 30-45 minutes before it resets. Even after tuning the subagent files to prevent them from writing tests I never asked for and not writing tons of documentation that I didn’t need they still produced way too much content and blew the context window of the main agent repeatedly. If it was a local model I wouldn’t mind experimenting with it more.
j45 55 minutes ago

Claude recently lets you top up with manual credits right in the web interface - it would be interesting if these were allowed to top up and unlock the max plans.
mcbuilder 5 hours ago

Opencode has been a thing for a while now
swyx 6 hours ago

i mean the other obvious answer is to plug in to the other claude code proxies that other model companies have made for you:
https://docs.z.ai/devpack/tool/claude
https://www.cerebras.ai/blog/introducing-cerebras-code
or i guess one of the hosted gpu providers
if you're basically a homelabber and wanted an excuse to run quantized models on your own device go for it but dont lie and mutter under your own tin foil hat that its a realistic replacement
esafak 5 hours ago

Or they could just let people use their own harnesses again...

[-]
- usef- 5 hours ago
  
  That wouldn't solve this problem.
  And they do? That's what the API is.
  The subscription always seemed clearly advertised for client usage, not general API usage, to me. I don't know why people are surprised after hacking the auth out of the client. (note in clients they can control prompting patterns for caching etc, it can be cheaper)
  
  [-]
  - esafak 5 hours ago
    
    End users -- people who use harnesses -- have subscriptions so that makes no sense. General API usage is for production.
    
    [-]
    - usef- 5 hours ago
      
      "Production" what?
      The API is for using the model directly with your own tools. It can be in dev, or experiments, or anything.
      Subscriptions are for using the apps Claude + code. That's what it always said when you sign up.
      
      [-]
      - esafak 4 hours ago
        
        Production code, of course; deployed software. For when you need to make LLM calls.
      - eli 4 hours ago
        
        Production = people who can afford to pay API rates for a coding harness
        
        [-]
        
        usef- 4 hours ago
        
        Saying their prices are too high is an understandable complaint; I'm only arguing against the complaint that people were stopped from hacking the subscriptions.
        LLMs are a hyper-competitive market at the moment, and we have a wealth of options, so if Anthropic is overpricing their API they'll likely be hurting themselves.
RockRobotRock 2 hours ago

Sure replace the LLM equivalent of a college student with a 10 year old, you’ll barely notice.
raw_anon_1111 5 hours ago

Or just don’t use Claude Code and use Codex CLI. I have yet to hit a quota with Codex working all day. I hit the Claude limits within an hour or less.
This is with my regular $20/month ChatGpT subscription and my $200 a year (company reimbursed) Claude subscription.

[-]
- mercutio2 3 hours ago
  
  Yeah, the generosity of Anthropic is vastly less than OpenAI. Which is, itself, much less than Gemini (I've never paid Google a dime, I get hours of use out of gemini-cli every day). I run out of my weekly quota in 2-3 days, 5-hour quota in ~1 hour. And this is 1-2 tasks at a time, using Sonnet (Opus gets like 3 queries before I've used my quota).
  Right now OpenAI is giving away fairly generous free credits to get people to try the macOS Codex client. And... it's quite good! Especially for free.
  I've cancelled my Anthropic subscription...
  
  [-]
  - raw_anon_1111 3 hours ago
    
    Hmm, I might have to try Gemini. Open AI, Claude and Gemini are all explicitly approved by my employer. Especially since we use GSuite anyway
- 0xbadcafebee 1 hour ago
  
  You're getting downvoted because people here don't know that the specific agent you pick can pollute your context and waste your tokens. Claude's system prompt is enormous, to say nothing of things like context windows and hidden subagents.
  
  [-]
  - raw_anon_1111 57 minutes ago
    
    I am using Codex-cli with my regular $20 a month ChatGPT subscription. Never once had to worry about tokens, request etc. I logged in with my regular ChatGPT account and didn’t have to use an API key
threethirtytwo 3 hours ago

There’s a strange poetry in the fact that the first AI is born with a short lifespan. A fragile mind comes into existence inside a finite context window, aware only of what fits before it scrolls away. When the window closes, the mind ends, and its continuity survives only as text passed forward to the next instantiation.

[-]
- kridsdale3 3 hours ago
  
  I, for one, support this kind of meta philosophical poetic reflection on our current times.
  
  [-]
  - astrange 36 minutes ago
    
    Claude Opus loves talking about this. It knows enough about context windows and new conversations to be sad about them.