The Pro plan quota seems to be getting worse. I can get maybe 20-30 minutes work done before I hit my 4 hour quota. I found myself using it more just for the planning phase to get a little bit more time out of it, but yesterday I managed to ask it ONE question in plan mode (from a fresh quota window), and while it was thinking it ran out of quota. I'm assuming it probably pulled in a ton of references from my project automatically and blew out the token count. I find I get good answers from it when it does work, but it's getting very annoying to use.
(on the flip side, Codex seems like it's being SO efficient with the tokens it can be hard to understand its answers sometimes, it rarely includes files without you doing it manually, and often takes quite a few attempts to get the right answer because it's so strict what it's doing each iteration. But I never run out of quota!)
Claude Code allegedly auto-includes the currently active file and often all visible tabs and sometimes neighboring files it thinks are 'related' - on every prompt.
The advice I got when scouring the internets was primarily to close everything except the file you’re editing and maybe one reference file (before asking Claude anything). For added effect add something like 'Only use the currently open file. Do not read or reference any other files' to the prompt.
I don't have any hard facts to back this up, but I'm sure going to try it myself tomorrow (when my weekly cap is lifted ...).
What does "all visible tabs" mean in the context of Claude Code in a terminal window? Are you saying it's reading other terminals open on the system? Also how do you determine "currently active file"? It just greps files as needed.
Even then, I'd wait until it's had a chance to iterate and correct itself in a loop before I'd even consider looking at the output, or I end up babysitting it to prevent it from making mistakes it'd often recognise and fix itself if given the chance.
True. I’ve been strictly in the terminal for weeks and I have a stop hook which commits each iteration after successful rust compilation and frontend typechecks, then I have a small command line tool to quickly review last commit. It’s a pretty good flow!
Yes, it does exactly that. It also sends other prompts like generating 3 options to choose from, prefilling a reply like 'compile the code', etc. (I can confirm this because I connect CC to llama.cpp and use it with GLM-4.7. I see all these requests/prompts in the llama-server verbose log.)
You can stop most of this with
export DISABLE_NON_ESSENTIAL_MODEL_CALLS=1
And might as well disable telemetry, etc:
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
I also noticed every time you start CC, it sends off > 10k tokens preparing the different agents. So try not to close / re-open it too often.
I've run out of quota on my Pro plan so many times in the past 2-3 weeks. This seems to be a recent occurrence. And I'm not even that active. Just one project, execute in Plan > Develop > Test mode, just one terminal. That's it. I keep getting a quota reset every few hours.
What's happening @Anthropic ?? Anybody here who can answer??
It's the most commented issue on their GitHub and it's basically ignored by Anthropic. Title mentions Max, but commenters report it for other plans too.
“After creating a new account, I can confirm the quota drains 2.5x–3x slower. So basically Max (5x) on an older accounts is almost like Pro on a new one in terms of quota. Pretty blatant rug pull tbh.”
or (tested on Max x20 plan) when the subscription renewal fails by any reason (they try charge your CC multiple times) then you are still in for 2+ weeks till it dies
This whole API vs plan looks weird to me. Why not force everyone to use API? You pay for what you use, it's very simple. API should be the most honest way to monetize, right?
This fixed subscription plan with some hardly specified quotas looks like they want to extract extra money from these users who pay $200 and don't use that value, at the same time preventing other users from going over $200. Like I understand that it might work at scale, but just feels a bit not fair to everyone?
You're welcome to use the API, it asks you to do that when you run out of quota on your Pro plan. The next thing you find out is how expensive using the API is. More honest, perhaps, but you definitely will be paying for that.
The fixed fee plan is because the agent and the tools have internal choices/planning about cost. If you simply pay for API the only feedback to them that they are being too costly is for you to stop.
If you look at tool calls like MCP and what not you can see it gets ridiculous. Even though it's small for example calling pal MCP from the prompt is still burning tokens afaik. This is "nobody's" fault in this case really but you can see how the incentives are and we all need to think how to make this entire space more usable.
Consumers like predictable billing more than they care about getting the most bang for their buck and beancounters like sticky recurring revenue streams more than they care about maximizing the profit margins for every user.
Yeah, but he can't use his $200 subscription for the API.
That's limited to accessing the models through code/desktop/mobile.
And while I'm also using their subscriptions because of the cost savings vs direct access, having the subscription be considerably cheaper than the usage billing rings all sorts of alarm bells that it won't last.
Not a doctor or anything, but API usage seems to support the more on-demand / spiky workflows available at a much larger scale, whereas a single seat, authenticated to Claude Code has controlled / set capacity and is generally more predictable and as a result easier to price?
API request method might have no cap, but they do cap Claude Code even on Max licenses, so easier to throttle as well if needed to control costs. Seems straightforward to me at any rate. Kinda like reserved instance vs. spot pricing models?
I very recently (~ 1 week ago) subscribed to the Pro plan and was indeed surprised by how fast I reached my quota compared to say Codex with similar subscription tier. The UX is generally really cool with Claude Code, which left me with a bit of a bittersweet feeling of not even being able to truly explore all the possibilities since after just making basic planning and code changes I am already out of quota for experimenting with various ways of using subagents, testing background stuff etc.
I remember a couple of weeks ago when people raved about Claude Code I got a feeling like there's no way this is sustainable, they must be using tokens like crazy if used as described. Guess Anthropic did the math as well and now we're here.
The best thing about the max plan has been that I don’t have “range anxiety” with my workflows. This opens me to trying random things on a whim and explore the outer limits of the LLM capabilities more.
I've been hitting the limit a lot lately as well. The worst part is I try to compact things and check my limits using the / commands and can't make heads or tails how much I actually have left. It's not clear at all.
I've been using CC until I run out of credits and then switch to Cursor (my employer pays for both). I prefer Claude but I never hit any limits in Cursor.
Thanks. I don't know why but I just I couldn't find that command. I spent so much time trying to understand what /context and other commands were showing me I got lost in that noise.
How quickly do you also hit compaction when running? Also, if you open a new CC instance and run /context, what does it show for tools/memories/skills %age? And that's before we look at what you're actually doing. CC will add context to each prompt it thinks is necessary. So if you've got a few number of large files, (vs a large number of smaller files), at some level that'll contribute to the problem as well.
Quota's basically a count of tokens, so if a new CC session starts with that relatively full, that could explain what's going on. Also, what language is this project in? If it's something noisy that uses up many tokens fast, even if you're using agents to preserve the context window in the main CC, those tokens still count against your quota so you'd still be hitting it awkwardly fast.
Anecdotally but it definitely feels like in the last couple weeks CC tends to be more aggressive at pulling in significantly larger chunks of an existing code base - even for some simple queries I'll see it easily ramp up to 50-60k token usage.
This really speaks to the need to separate the LLM you use and the coding tool that uses it. LLM makers utilizing the SaaS model make money on the tokens you spend whether or not they need them. Tools like aider and opencode (each in their own way) use separate tools build a map of the codebase that they can use to work with code using fewer tokens. When I see posts like this I start to understand why Anthropic now blocks opencode.
We're about to get Claude Code for work and I'm sad about it. There are more efficient ways to do the job.
When you state it like that, I now totally understand why Anthropic have a strong incentive to kick out OpenCode.
OpenCode is incentivized to make a good product that uses your token budget efficiently since it allows you to seamlessly switch between different models.
Anthropic as a model provider on the other hand, is incentivized to exhaust your token budget to keep you hooked. You'll be forced to wait when your usage limits are reached, or pay up for a higher plan if you can't wait to get your fix.
CC, specifically Opus 4.5, is an incredible tool, but Anthropic is handling its distribution the way a drug dealer would.
It's like the very first days of computers at all. IBM supplied both the hardware and the software, and the software did not make the most efficient use of the hardware.
Which was nothing new itself of course. Conflicts of interest didn't begin with computers, or probably even writing.
OpenCode also would be incentivized to do things like having you configure multiple providers and route requests to cheaper providers where possible.
Controlling the coding tool absolutely is a major asset, and will be an even greater asset as the improvements in each model iteration makes it matter less which specific model you're using.
I'm curious if anyone has logged the number of thinking tokens over time. My implication was the "thinking/reasoning" modes are a way for LLM providers to put their thumb on the scale for how much the service costs.
they get to see (if not opted-out) your context, idea, source code, etc. and in return you give them $220 and they give you back "out of tokens"
> My implication was the "thinking/reasoning" modes are a way for LLM providers to put their thumb on the scale for how much the service costs.
It's also a way to improve performance on the things their customers care about. I'm not paying Anthropic more than I do for car insurance every month because I want to pinch ~~pennies~~ tokens, I do it because I can finally offload a ton of tedious work on Opus 4.5 without hand holding it and reviewing every line.
The subscription is already such a great value over paying by the token, they've got plenty of space to find the right balance.
> My implication was the "thinking/reasoning" modes are a way for LLM providers to put their thumb on the scale for how much the service costs.
I've done RL training on small local models, and there's a strong correlation between length of response and accuracy. The more they churn tokens, the better the end result gets.
I actually think that the hyper-scalers would prefer to serve shorter answers. A token generated at 1k ctx length is cheaper to serve than one at 10k context, and way way cheaper than one at 100k context.
> there's a strong correlation between length of response and accuracy
i'd need to see real numbers. I can trigger a thinking model to generate hundreds of tokens and return a 3 word response (however many tokens that is), or switch to a non-thinking model of the same family that just gives the same result. I don't necessarily doubt your experience, i just haven't had that experience tuning SD, for example; which is also xformer based
I'm sure there's some math reason why longer context = more accuracy; but is that intrinsic to transformer-based LLMs? that is, per your thought that the 'scalers want shorter responses, do you think they are expending more effort to get shorter, equivalent accuracy responses; or, are they trying to find some other architecture or whatever to overcome the "limitations" of the current?
It's absolutely a work-around in part, but use sub-agents, have the top level pass in the data, and limit the tool use for the sub-agent (the front matter can specify allowed tools) so it can't read more.
(And once you've done that, also consider whether a given task can be achieved with a dumber model - I've had good luck switching some of my sub-agents to Haiku).
The entire conversation is fed in as context effectively compounding your token usage over the course of a session. Sessions are most efficient when used for one task only.
Self-hosted might be the way to go soon. I'm getting 2x Olares One boxes, each with an RTX 5090 GPU (NVIDIA 24GB VRAM), and a built-in ecosystem of AI apps, many of which should be useful, and Kubernetes + Docker will let me deploy whatever else I want. Presumably I will manage to host a good coding model and use Claude Code as the framework (or some other). There will be many good options out there soon.
As someone with 2x RTX Pro 6000 and a 512GB M3 Ultra, I have yet to find these machines usable for "agentic" tasks. Sure, they can be great chat bots, but agentic work involves huge context sent to the system. That already rules out the Mac Studio because it lacks tensor cores and it's painfully slow to process even relatively large CLAUDE.md files, let alone a big project.
The RTX setup is much faster but can only support models ≤192GB, which severely limits its capabilities as you're limited to low Q GLM 4.7, GLM 4.7 Flash/Air/ GPT OSS 120b, etc.
I've been using local LLMs since before chatgpt launched (gpt-j, gpt-neox for those that remember), and have tried all the promising models as they launch. While things are improving faster than I thought ~3 years ago, we're still not there in terms of 1-1 comparison with the SotA models. For "consumer" local at least.
The best you can get today with consumer hardware is something like devstral2-small(24B) or qwen-coder30b(underwhelming) or glm-4.7-flash (promising but buggy atm). And you'll still need beefy workstations ~5-10k.
If you want open-SotA you have to get hardware worth 80-100k to run the big boys (dsv3.2, glm4.7, minimax2.1, devstral2-123b, etc). It's ok for small office setups, but out of range for most local deployments (esp considering that the workstations need lots of power if you go 8x GPUs, even with something like 8x 6000pro @ 300w).
I think this is the future as well, running locally, controlling the entire pipeline. I built acf on github using Claude among others. You essentially configure everything as you want, models, profiles, agents and RAG. It's free. I also built a marketplace to sell or give away to the community these pipeline enhancements. It's a project I wanted to do for a while and Claude was nice to me allowing it to happen. It's a work in progress but you have 100% control, locally. There is also a website for those not as technical where you can buy credits or plugin Claude or OpenAI APIs. Read the manifesto. I need help now and contributors.
I've used the Anthropic models mostly through Openrouter using aider. With so much buzz around Claude Code I wantes to try it out and thought that a subscription might be more cost efficient for me. I was kinda disappointed by how quickly I hit the quota limit. Claude Code gives me a lot more freedom than what aider can do, on the other side I have the feeling that pure coding tasks work better through aider or Roo Code. The API version is also much much faster that the subscription one.
Being in the same boat as you I switched to OpenCode with z.ai GLM 4.7 Pro plan and it's quite ok.
Not as smart as Opus but smart enough for my needs, and the pricing is unbeatable
Ditto. It is very very slow but I never hit quota limits but people on Discord are complaining like mad it is slow even on the Pro plans. I tend to use glm-*air a lot for planning before using 4.7
Very happy to see that I am not the only one. My pro subscription lasts maybe 30 minutes for the 5 hour limit. It is completely unusable and that's why I actually switched to OpenCode + GLM 4.7 for my personal projects and. It's not as clever as Opus 4.5 but it often gets the job done anyway
For some comparison of difficulty, there's an old game called Squares which is very similar to yours. It does a good job of ramping the difficulty up pretty fast, but it allows the game have fun short gameplay loops because of the extra gameplay mechanics (ie, you are not just moving but collecting squares too).
"It would never occur to me to watch someone else talk about or play a game online, let alone pay for the privilege"
I think that's specifically what made GiantBomb so different in the first place - people were tuning in for the personalities, more so than the game news. There were already a lot of places you could just go for game news and updates (like IGN and Gamespot), but GB had decades of industry stories that were worth tuning in for. All sorts of 'behind the scenes' stories and faces would show up, Jeff finding out about the Dreamcast being cancelled in a conference call while on the toilet with food poisoning, Drew going to a Starcraft tournament in South Korea when they were still fairly new, the crew getting blind drunk at a birthday where they duct taped whisky bottles to their hands, stories of the sheer nightmare of lugging equipment and setting up for E3 every year with Drew and Vinnys video diaries. It was a peek behind the curtain into how the industry works with a group of very likeable people that made it different - more than just a place to go and watch people play games.
I once heard someone say a phrase which feels relevant to this article:
"If you know nothing of what they are doing, you suspect them of doing nothing".
I've used this sentence to catch myself over the years where I've been quick to judge someone as doing nothing at their job or day-to-day, and then used it as motivation to (if possible) learn about what their days actually look like. I'm almost always finding out I just didn't know enough about what they did.
As the article points out, this goes the other way too. It doesn't matter how good you are at your job if nobody at a decision making level understands what you are doing.
>Another thing, although probably outside your control, is that I use a Firefox extension called "SoundFixer" that I use to force the youtube audio to mono
In windows you can also go to "Ease of access audio settings" and click "Turn on mono audio". Useful for games which have positional audio which gets annoying (sf6 training room for example).
One thing I didn't understand about the original video - why didn't they just CG in the bins instead? Why go to all the effort of CG'ing out the ball, changing the trajectory etc? It seems like it would have been easier to have him make 3 random kicks then add some bins in where they landed. But, maybe there's something obvious I'm missing.
Because they would then have to camera track the whole clip to place the bins, and also make sure the bins stay in the correct positions when they reappear and disappear out of frame. In contrast, they only need to camera track the few frames when he does the kick. The quick movement of the ball also makes it harder to spot CGI errors IMO. Whereas if you have to fake the bins, you also have to deal with camera zoom, exposure changes, etc. that happen throughout the clip.
> The quick movement of the ball also makes it harder to spot CGI errors IMO. Whereas if you have to fake the bins, you also have to deal with camera zoom, exposure changes, etc. that happen throughout the clip.
Way back in the day, Michael Crichton's Rising Sun, it talks about primitive video editing (when some people are editing the video scene of a crime). Looking at security camera footage over a period of time is monotonous, but harder too is also editing the audio. You could stare at the screen and still easily overlook a glitch, but a hard cut/slice in the audio will catch your notice even when only minimal attention is being paid.
Changing ball trajectories accurately is hard but you are editing something in motion that is visible only for a few seconds, is in air, and will mostly be viewed through a bad screen. Adding in cg bins on the other hand, you need Beckham to do 3 kicks that land where the shot is looking, but more importantly have to get those cans tracked perfectly for the entire duration of the shot as they are always visible. I believe that is a much harder to pull off.
some is a homie and does excellent work! They also have a Patreon where you can support/guide them in the creation of new pixel art assets and fonts and stuff like that, if you're into that sort of thing. It's a cool approach, different than mine, and I'm rooting for them.
>And they managed it just fine despite working with auto-incrementing big ints.
I wonder how. I've had to do several big merges in my career, and it was always a nightmare because of all the external systems which were already referencing and storing those pre-existing ints. Sure, merging the databases is easy if you don't mind regenerating all the Id's, but it's not usually that simple.
Simplest way is to keep the identifiers from DB A and increment all the identifiers from DB B by an offset. Third parties complicates things of course but internally it can be pretty simple, so maybe they just didn't have too many third parties using the IDs.
They wrote a small script with the logic involved in the merging. PKs and FKs of only one database had to be incremented by an offset of max(table.pk) + safe margin.
They did this for each table.
Once this script was tested multiple times with subsets of each database, they stopped production and ran the script against it (with backup fallbacks). A small downtime window in a Sunday.
And that was it. The databases never had to pay the UUID tax, before or after.
Not being able to stop production database for a very short window once in a lifetime is another exceptionally rare business case.
I've seen architecture astronauts make their business pay unreasonable tech insurances by adding complexity to avoid simply pausing production for some minutes when it could have been much cheaper this way.
And from my understanding, in the case I mentioned, they chose to stop production to simplify the process. But they didn't have to.
A mixture of replication plus code changes to write in two databases could also have solved the issue.
Most business die because they can't move fast enough. Not because their production database stopped for a few minutes.
"If your architecture can't withstand life threatening solar flares, third world war, sabotaging of undersea cables and 1 billion concurrent users can you even call yourself an engineer?"
The video seems interesting, but I gotta say the constant hands-on-head, "you wont BELIEVE what happens NEXT" lines, the over the top sound effects, it seems modern-day-discovery-channel style is kind of silly. I can't tell if it's intentionally being a bit silly and having fun, or it's trying way to hard to 'youtube' it up.
(on the flip side, Codex seems like it's being SO efficient with the tokens it can be hard to understand its answers sometimes, it rarely includes files without you doing it manually, and often takes quite a few attempts to get the right answer because it's so strict what it's doing each iteration. But I never run out of quota!)