More

brynary · 2025-12-29T22:59:38 1767049178

Strong agreement with everything in this post.

At Qlty, we are going so far as to rewrite hundreds of thousands of lines of code to ensure full test coverage, end-to-end type checking (including database-generated types).

I’ll add a few more:

1. Zero thrown errors. These effectively disable the type checker and act as goto statements. We use neverthrow for Rust-like Result types in TypeScript.

2. Fast auto-formatting and linting. An AI code review is not a substitute for a deterministic result in sub-100ms to guarantee consistency. The auto-formatter is set up as a post-tool use Claude hook.

3. Side-effect free imports and construction. You should be able to load all the code files and construct an instance of every class in your app without a network connection spawning. This is harder than it sounds and without it you run into all sorts of trouble with the rest.

3. Zero mocks and shared global state. By mocks, I mean mocking frameworks which override functions on existing types or global. These effectively are injecting lies into the type checker.

Should put to tsgo which has dramatically lowered our type checking latency. As the tok/sec of models keeps going up, all the time is going to get bottlenecked on tool calls (read: type checking and tests).

With this approach we now have near 100% coverage with a test suite that runs in under 1,000ms.

frio · 2025-12-30T00:30:40 1767054640

A TypeScript test suite that offers 100% coverage of "hundreds of thousands" of lines of code in under 1 second doesn't pass the sniff test.

brynary · 2025-12-30T00:43:22 1767055402

We're at 100k LOC between the tests and code so far, running in about 500-600ms. We have a few CPU intensive tests (e.g. cryptography) which I recently moved over to the integration test suite.

With no contention for shared resources and no async/IO, it just function calls running on Bun (JavaScriptCore) which measures function calling latency in nanoseconds. I haven't measured this myself, but the internet seems to suggest JavaScriptCore function calls can run in 2 to 5 nanoseconds.

On a computer with 10 cores, fully concurrent, that would imply 10 billion nanoseconds of CPU time in one wall clock second. At 5 nanoseconds per function call, that would imply a theoretical maximum of 2 billion function calls per second.

Real world is not going to be anywhere close to that performance, but where is the time going otherwise?

camel_gopher · 2025-12-30T04:37:43 1767069463

Hey now he said 1,000ms, not 1 second

ManuelKiessling · 2025-12-29T23:07:25 1767049645

I‘m on the same page as you, I‘m investing into DX and test coverage and quality tooling like crazy.

But the weird thing is: those things have always been important to me.

And it has always been a good idea to invest in those, for my team and me.

Why am doing this 200% now?

monatron · 2025-12-29T23:23:24 1767050604

If you're like me you're doing it to establish a greater level of trust in generated code. It feels easier to draw out the hard guard-rails and have something fill out the middle -- giving both you, and the models, a reference point or contract as to what's "correct"

ManuelKiessling · 2025-12-29T23:30:13 1767051013

Answering myself: maybe I feel much more urgency and motivation for this in the age of AI because the effects can be felt so much more acute and immediately.

mkozlows · 2025-12-29T23:40:12 1767051612

Because a) the benefits are bigger, and b) the effort is smaller. When something gets cheaper and more valuable, do more of it.

0x696C6961 · 2025-12-29T23:21:57 1767050517

For me it's because coworkers are pumping out horrible slop faster than ever before.

brynary · 2025-10-20T18:54:22 1760986462

The most interesting parts of this to me are somewhat buried:

- Claude Code has been added to iOS

- Claude Code on the Web allows for seamless switching to Claude Code CLI

- They have open sourced an OS-native sandboxing system which limits file system and network access _without_ needing containers

However, I find the emphasis on limiting the outbound network access somewhat puzzling because the allowlists invariably include domains like gist.github.com and dozens of others which act effectively as public CMS’es and would still permit exfiltration with just a bit of extra effort.

minimaxir · 2025-10-20T19:09:54 1760987394

Link to the GitHub for the native sandboxing: https://github.com/anthropic-experimental/sandbox-runtime

navanchauhan · 2025-10-20T20:15:00 1760991300

I used `sandbox-exec` previously before moving to a better solution (done right, sandboxing on macOS can be more powerful than Linux imo). The way `sandbox-exec` works is that all child processes inherit the same restrictions. For example, if you run `sandbox-exec $rules claude --dangerously-skip-permissions`, any commands executed by Claude through a shell will also be bound by those same rules. Since the sandbox settings are applied globally, you currently can’t grant or deny granular read/write permissions to specific tools.

Using a proxy through the `HTTP_PROXY` or `HTTPS_PROXY` environment variables has its own issues. It relies on the application respecting those variables—if it doesn’t, the connection will simply fail. Sure, in this case since all other network connection requests are dropped you are somewhat protected but then an application that doesn't respect them will just not work

You can also have some fun with `DYLD_INSERT_LIBRARIES`, but that often requires creating shims to make it work with codesigned binaries

joshdev · 2025-10-21T12:30:08 1761049808

What is the better solution you’ve moved on to?

navanchauhan · 2025-10-21T18:49:47 1761072587

Endpoint Security Extension and Network Extension

kylehotchkiss · 2025-10-21T01:58:53 1761011933

Could this be used for Xcode-server? I dont like how it has access to full host filesystem

fragmede · 2025-10-20T19:53:37 1760990017

Exfiltration is always going to be possible, the question is, is it difficult enough for an attacker to succeed against the defenses I've put in place. The problem is, I really want to share, and help protect others, but if I write it up somewhere anybody can read, it's gonna end up in the training data.

koolala · 2025-10-20T21:47:42 1760996862

The attacker being an LLM where all humans have to be careful what they say publicly online is a fun vector.

merrvk · 2025-10-20T21:45:39 1760996739

Nice its in the app, trying it out, seems damn buggy at the moment.

brynary · 2025-10-05T16:21:03 1759681263

What benefits do you see from having the agent call a CLI like this via MCP as opposed to just executing the CLI as a shell command and taking action on the stdout?

mogwire · 2025-10-05T16:27:15 1759681635

This is one of the most important questions I see when people recommend an MCP server.

If cursor and Claude code can already run an executable why do I need to add an MCP server in front of it?

I feel like a lot of times it’s, “Because AI”

scosman · 2025-10-05T18:46:59 1759690019

Few things:

- Security/Speed: I leave "approve CLI commands" on in Cursor. This functions as a whitelist of known safe commands. It only needs to ask if running a non-standard command, 99% of the time it can use tools. It will also verify paths passed by the model are in the project folder (not letting it execute on external files)

- Discoverability: For agents to work well, you need to explain which commands are available, when to use each, parameters, etc. This is a more formal version than a simple AGENTS.md, with typed parameters, tool descriptions, etc.

- Correctness: I find models mess up command strings or run them in the wrong folders. This is more robust than pure strings, with short tool names, type checking, schemas, etc.

- Parallel execution: MCP tools can run in parallel, CLI tools typically can't

- Sharing across team: which dev commands to run can be spread across agents.md, github workflows, etc. This is one central place for the agents use case.

- Prompts: MCP also supports prompts (less known MCP feature). Not really relevant to the "why not CLI" question, but it's a benefit of the tool. It provides a short description of the available prompts, then lets the model load any by name. It's requires much less room in context than loading an entire /agents folder.

brynary · 2025-10-05T14:59:53 1759676393

This looks great! Duplication and dead code are especially tricky to catch because they are not visible in diffs.

Since you mentioned the implementation details, a couple questions come to mind:

1. Are there any research papers you found helpful or influential when building this? For example, I need to read up on using tree edit distance for code duplication.

2. How hard do you think this would be to generalize to support other programming languages?

I see you are using tree-sitter which supports many languages, but I imagine a challenge might be CFGs and dependencies.

I’ll add a Qlty plugin for this (https://github.com/qltysh/qlty) so it can be run with other code quality tools and reported back to GitHub as pass/fail commit statuses and comments. That way, the AI coding agents can take action based on the issues that pyscn finds directly in a cloud dev env.

d-yoda · 2025-10-05T15:24:50 1759677890

Thank you! 1.For tree edit distance, I referred to "APTED: A Fast Tree Edit Distance Algorithm" (Pawlik & Augsten, 2016), but the algorithm works as O(n²) so I also implemented LSH (classic one) for large codebases.The other analyses also use classical compiler theory and techniques. 2. Should be straightforward! tree-sitter gives us parsers for 40+ languages. CFG construction is just tracking control flow, and the core algorithm stays the same.

I focused on Python first because vibe coding with Python tends to accumulate more structural issues. But the same techniques should apply to other languages as well.

Excited about the Qlty integration - that would make pyscn much more accessible and would be amazing!

brynary · 2025-09-06T18:49:38 1757184578

Historically, this kind of test optimization was done either with static analysis to understand dependency graphs and/or runtime data collected from executing the app.

However, those methods are tightly bound to programming languages, frameworks, and interpreters so they are difficult to support across technology stacks.

This approach substitutes the intelligence of the LLM to make educated guesses about what tests execute, to achieve the same goal of executing all of the tests that could fail and none of the rest (balancing a precision/recall tradeoff). What’s especially interesting about this to me is that the same technique could be applied to any language or stack with minimal modification.

Has anyone seen LLMs in other contexts being substituted for traditional analysis to achieve language agnostic results?

brynary · 2025-08-04T21:22:12 1754342532

This rings similar to a recent post that was on the front page about red team vs. blue team.

Before running LLM-generated code through yet more LLMs, you can run it through traditional static analysis (linters, SAST, auto-formatters). They aren’t flashy but they produce the same results 100% of the time.

Consistency is critical if you want to pass/fail a build on the results. Nobody wants a flaky code reviewer robot, just like flaky tests are the worst.

I imagine code review will evolve into a three tier pyramid:

1. Static analysis (instant, consistent) — e.g using Qlty CLI (https://github.com/qltysh/qlty) as a Claude Code or Git hook

2. LLMs — Has the advantage of being able to catch semantic issues

3. Human

We make sure commits pass each level in succession before moving on to the next.

dakshgupta · 2025-08-04T21:29:33 1754342973

Reading that post sent me down the path to this one. This stack order makes total sense, although in practice it's possible 1-2 merge into a single product with two distinct steps.

The 3. is interesting too - my suspicion is that ~70% of PRs are too minor to need human review as the models get better, but the top 30% will because there will be opinion on what is and isn't the right way to do that complex change.

brynary · 2025-07-08T19:18:58 1752002338

As an early June customer, this is a big disappointment. We specifically selected June over Mixpanel and Amplitude and were happy with it.

I wish there was more honesty in the post about what happened. When you boil down the details, it basically just seems to say the founders decided they would rather become (the X-hundredth) engineers at Amplitude.

Unless they were running out of money, I don’t see how they’ll have a “bigger impact” doing that instead of building a fresh take on the B2B analytics space.

HEGalloway · 2025-07-09T03:02:44 1752030164

I was quite surprised to see the acquisition! I think they never found PMF because they pivoted upmarket and were trying to do too many things.

Even though I'm not a customer of June, I've always rooted for them from the sidelines. They were a different kind of YC company; these days, you only see AI slop companies coming out of YC.

Seems like the founders lost hope in it and probably sold it to Amplitude to have a soft landing instead of a complete crash and burn. Since both are YC companies, i think this was an internal nudge towards a sale. But can't say for sure, i think the founders will share what truly happened in a few years once they're contract expires. Anyways, good luck to the founders!

brynary · 2025-07-01T02:00:11 1751335211

This can be implemented at the line level if the linter is Git aware

brynary · 2025-07-01T00:40:50 1751330450

This closes a big feature gap. One thing that may not be obvious is that because of the way Claude Code generates commits, regular Git hooks won’t work. (At least, in most configurations.)

We’ve been using CLAUDE.md instructions to tell Claude to auto-format code with the Qlty CLI (https://github.com/qltysh/qlty) but Claude a bit hit and miss in following them. The determinism here is a win.

It looks like the events that can be hooked are somewhat limited to start, and I wonder if they will make it easy to hook Git commit and Git push.

symbolicAGI · 2025-07-01T14:03:15 1751378595

FYI.

Claude loves Java.

I never have to reformat. It picks up my indentation preferences immediately and obeys my style guide flawlessly. When I ask it to perfect my JavaDoc it is awesome.

Must be a ton of fabulous enterprise Java in the training set.

Implicated · 2025-07-01T17:30:11 1751391011

This is pretty much my experience with PHP/Laravel (on modern versions, 11/12, on legacy projects it has a hard time "remembering" it needs to use different syntax/methods to do things)

alfons_foobar · 2025-07-01T06:58:47 1751353127

Why is it that regular git hooks do not work with claude code?

brynary · 2025-07-01T13:48:47 1751377727

When using Claude Code cloud, in order to create signed commits, Claude uses the GitHub API to create commits instead of the git CLI

alfons_foobar · 2025-07-01T17:49:23 1751392163

Huh, I had not heard of claude code cloud before.

(Also almost swallowed my tongue saying that out loud)

ramoz · 2025-07-01T14:07:32 1751378852

Wdym Claude Code cloud? Like in a cloud env?

BonoboIO · 2025-07-01T13:04:52 1751375092

Husky and lint-staged worked for me. Pre Commit Hooks did not work for me.

brynary · 2025-06-18T18:54:23 1750272863

@jaimefjorge — Congrats on the launch!

How would you compare this to the Qlty CLI (https://github.com/qltysh/qlty)?

Do you plan to support CLI-based workflows for tools like Claude Code and linting?

jaimefjorge · 2025-06-18T22:06:13 1750284373

Hi Brian. Thanks!

I think at first glance we try to establish a strong bond between what’s running in the IDE with our CLI and what tool configs you have running on the cloud in Codacy. We spend a lot of time on coding standards, gates, and making all the tools that we integrate (which seems to be pretty comparable to qlty - we do have our own tools right now for example for secret scanning) run well with good standards for large teams. We also have an MCP server and we found that tying code analysis with code agents is not trivial so I think that’s also something different. Beyond that, DAST + Pen testing, etc. We’ve become a full-on security company and that’s been our focus.

We do and we’re looking into it. It really started for us when we launched an MCP server.