At Qlty, we are going so far as to rewrite hundreds of thousands of lines of code to ensure full test coverage, end-to-end type checking (including database-generated types).
I’ll add a few more:
1. Zero thrown errors. These effectively disable the type checker and act as goto statements. We use neverthrow for Rust-like Result types in TypeScript.
2. Fast auto-formatting and linting. An AI code review is not a substitute for a deterministic result in sub-100ms to guarantee consistency. The auto-formatter is set up as a post-tool use Claude hook.
3. Side-effect free imports and construction. You should be able to load all the code files and construct an instance of every class in your app without a network connection spawning. This is harder than it sounds and without it you run into all sorts of trouble with the rest.
3. Zero mocks and shared global state. By mocks, I mean mocking frameworks which override functions on existing types or global. These effectively are injecting lies into the type checker.
Should put to tsgo which has dramatically lowered our type checking latency. As the tok/sec of models keeps going up, all the time is going to get bottlenecked on tool calls (read: type checking and tests).
With this approach we now have near 100% coverage with a test suite that runs in under 1,000ms.
We're at 100k LOC between the tests and code so far, running in about 500-600ms. We have a few CPU intensive tests (e.g. cryptography) which I recently moved over to the integration test suite.
With no contention for shared resources and no async/IO, it just function calls running on Bun (JavaScriptCore) which measures function calling latency in nanoseconds. I haven't measured this myself, but the internet seems to suggest JavaScriptCore function calls can run in 2 to 5 nanoseconds.
On a computer with 10 cores, fully concurrent, that would imply 10 billion nanoseconds of CPU time in one wall clock second. At 5 nanoseconds per function call, that would imply a theoretical maximum of 2 billion function calls per second.
Real world is not going to be anywhere close to that performance, but where is the time going otherwise?
If you're like me you're doing it to establish a greater level of trust in generated code. It feels easier to draw out the hard guard-rails and have something fill out the middle -- giving both you, and the models, a reference point or contract as to what's "correct"
Answering myself: maybe I feel much more urgency and motivation for this in the age of AI because the effects can be felt so much more acute and immediately.
The most interesting parts of this to me are somewhat buried:
- Claude Code has been added to iOS
- Claude Code on the Web allows for seamless switching to Claude Code CLI
- They have open sourced an OS-native sandboxing system which limits file system and network access _without_ needing containers
However, I find the emphasis on limiting the outbound network access somewhat puzzling because the allowlists invariably include domains like gist.github.com and dozens of others which act effectively as public CMS’es and would still permit exfiltration with just a bit of extra effort.
I used `sandbox-exec` previously before moving to a better solution (done right, sandboxing on macOS can be more powerful than Linux imo). The way `sandbox-exec` works is that all child processes inherit the same restrictions. For example, if you run `sandbox-exec $rules claude --dangerously-skip-permissions`, any commands executed by Claude through a shell will also be bound by those same rules. Since the sandbox settings are applied globally, you currently can’t grant or deny granular read/write permissions to specific tools.
Using a proxy through the `HTTP_PROXY` or `HTTPS_PROXY` environment variables has its own issues. It relies on the application respecting those variables—if it doesn’t, the connection will simply fail. Sure, in this case since all other network connection requests are dropped you are somewhat protected but then an application that doesn't respect them will just not work
You can also have some fun with `DYLD_INSERT_LIBRARIES`, but that often requires creating shims to make it work with codesigned binaries
Exfiltration is always going to be possible, the question is, is it difficult enough for an attacker to succeed against the defenses I've put in place. The problem is, I really want to share, and help protect others, but if I write it up somewhere anybody can read, it's gonna end up in the training data.
What benefits do you see from having the agent call a CLI like this via MCP as opposed to just executing the CLI as a shell command and taking action on the stdout?
- Security/Speed: I leave "approve CLI commands" on in Cursor. This functions as a whitelist of known safe commands. It only needs to ask if running a non-standard command, 99% of the time it can use tools. It will also verify paths passed by the model are in the project folder (not letting it execute on external files)
- Discoverability: For agents to work well, you need to explain which commands are available, when to use each, parameters, etc. This is a more formal version than a simple AGENTS.md, with typed parameters, tool descriptions, etc.
- Correctness: I find models mess up command strings or run them in the wrong folders. This is more robust than pure strings, with short tool names, type checking, schemas, etc.
- Parallel execution: MCP tools can run in parallel, CLI tools typically can't
- Sharing across team: which dev commands to run can be spread across agents.md, github workflows, etc. This is one central place for the agents use case.
- Prompts: MCP also supports prompts (less known MCP feature). Not really relevant to the "why not CLI" question, but it's a benefit of the tool. It provides a short description of the available prompts, then lets the model load any by name. It's requires much less room in context than loading an entire /agents folder.
This looks great! Duplication and dead code are especially tricky to catch because they are not visible in diffs.
Since you mentioned the implementation details, a couple questions come to mind:
1. Are there any research papers you found helpful or influential when building this? For example, I need to read up on using tree edit distance for code duplication.
2. How hard do you think this would be to generalize to support other programming languages?
I see you are using tree-sitter which supports many languages, but I imagine a challenge might be CFGs and dependencies.
I’ll add a Qlty plugin for this (https://github.com/qltysh/qlty) so it can be run with other code quality tools and reported back to GitHub as pass/fail commit statuses and comments. That way, the AI coding agents can take action based on the issues that pyscn finds directly in a cloud dev env.
Thank you!
1.For tree edit distance, I referred to "APTED: A Fast Tree Edit Distance Algorithm" (Pawlik & Augsten, 2016), but the algorithm works as O(n²) so I also implemented LSH (classic one) for large codebases.The other analyses also use classical compiler theory and techniques.
2. Should be straightforward! tree-sitter gives us parsers for 40+ languages. CFG construction is just tracking control flow, and the core algorithm stays the same.
I focused on Python first because vibe coding with Python tends to accumulate more structural issues. But the same techniques should apply to other languages as well.
Excited about the Qlty integration - that would make pyscn much more accessible and would be amazing!
Historically, this kind of test optimization was done either with static analysis to understand dependency graphs and/or runtime data collected from executing the app.
However, those methods are tightly bound to programming languages, frameworks, and interpreters so they are difficult to support across technology stacks.
This approach substitutes the intelligence of the LLM to make educated guesses about what tests execute, to achieve the same goal of executing all of the tests that could fail and none of the rest (balancing a precision/recall tradeoff). What’s especially interesting about this to me is that the same technique could be applied to any language or stack with minimal modification.
Has anyone seen LLMs in other contexts being substituted for traditional analysis to achieve language agnostic results?
This rings similar to a recent post that was on the front page about red team vs. blue team.
Before running LLM-generated code through yet more LLMs, you can run it through traditional static analysis (linters, SAST, auto-formatters). They aren’t flashy but they produce the same results 100% of the time.
Consistency is critical if you want to pass/fail a build on the results. Nobody wants a flaky code reviewer robot, just like flaky tests are the worst.
I imagine code review will evolve into a three tier pyramid:
1. Static analysis (instant, consistent) — e.g using Qlty CLI (https://github.com/qltysh/qlty) as a Claude Code or Git hook
2. LLMs — Has the advantage of being able to catch semantic issues
3. Human
We make sure commits pass each level in succession before moving on to the next.
Reading that post sent me down the path to this one. This stack order makes total sense, although in practice it's possible 1-2 merge into a single product with two distinct steps.
The 3. is interesting too - my suspicion is that ~70% of PRs are too minor to need human review as the models get better, but the top 30% will because there will be opinion on what is and isn't the right way to do that complex change.
As an early June customer, this is a big disappointment. We specifically selected June over Mixpanel and Amplitude and were happy with it.
I wish there was more honesty in the post about what happened. When you boil down the details, it basically just seems to say the founders decided they would rather become (the X-hundredth) engineers at Amplitude.
Unless they were running out of money, I don’t see how they’ll have a “bigger impact” doing that instead of building a fresh take on the B2B analytics space.
I was quite surprised to see the acquisition! I think they never found PMF because they pivoted upmarket and were trying to do too many things.
Even though I'm not a customer of June, I've always rooted for them from the sidelines. They were a different kind of YC company; these days, you only see AI slop companies coming out of YC.
Seems like the founders lost hope in it and probably sold it to Amplitude to have a soft landing instead of a complete crash and burn. Since both are YC companies, i think this was an internal nudge towards a sale. But can't say for sure, i think the founders will share what truly happened in a few years once they're contract expires. Anyways, good luck to the founders!
This closes a big feature gap. One thing that may not be obvious is that because of the way Claude Code generates commits, regular Git hooks won’t work. (At least, in most configurations.)
We’ve been using CLAUDE.md instructions to tell Claude to auto-format code with the Qlty CLI (https://github.com/qltysh/qlty) but Claude a bit hit and miss in following them. The determinism here is a win.
It looks like the events that can be hooked are somewhat limited to start, and I wonder if they will make it easy to hook Git commit and Git push.
I never have to reformat. It picks up my indentation preferences immediately and obeys my style guide flawlessly. When I ask it to perfect my JavaDoc it is awesome.
Must be a ton of fabulous enterprise Java in the training set.
This is pretty much my experience with PHP/Laravel (on modern versions, 11/12, on legacy projects it has a hard time "remembering" it needs to use different syntax/methods to do things)
I think at first glance we try to establish a strong bond between what’s running in the IDE with our CLI and what tool configs you have running on the cloud in Codacy. We spend a lot of time on coding standards, gates, and making all the tools that we integrate (which seems to be pretty comparable to qlty - we do have our own tools right now for example for secret scanning) run well with good standards for large teams. We also have an MCP server and we found that tying code analysis with code agents is not trivial so I think that’s also something different. Beyond that, DAST + Pen testing, etc. We’ve become a full-on security company and that’s been our focus.
We do and we’re looking into it. It really started for us when we launched an MCP server.
At Qlty, we are going so far as to rewrite hundreds of thousands of lines of code to ensure full test coverage, end-to-end type checking (including database-generated types).
I’ll add a few more:
1. Zero thrown errors. These effectively disable the type checker and act as goto statements. We use neverthrow for Rust-like Result types in TypeScript.
2. Fast auto-formatting and linting. An AI code review is not a substitute for a deterministic result in sub-100ms to guarantee consistency. The auto-formatter is set up as a post-tool use Claude hook.
3. Side-effect free imports and construction. You should be able to load all the code files and construct an instance of every class in your app without a network connection spawning. This is harder than it sounds and without it you run into all sorts of trouble with the rest.
3. Zero mocks and shared global state. By mocks, I mean mocking frameworks which override functions on existing types or global. These effectively are injecting lies into the type checker.
Should put to tsgo which has dramatically lowered our type checking latency. As the tok/sec of models keeps going up, all the time is going to get bottlenecked on tool calls (read: type checking and tests).
With this approach we now have near 100% coverage with a test suite that runs in under 1,000ms.