Hacker Newsnew | past | comments | ask | show | jobs | submit | boyter's commentslogin

Came here to post than and you already did. Thank you!

Grep prints out every matching line. For some searches a LLM might do it will get a lot of noise, and it might have to make that search because it cannot be specific. Targeted search can reduce the number of tokens.

I suspect this comparison is against reading the whole codebase though compared to just getting the bits you need.


Interesting. I too have been working in this space, though I took a different approach. Rather than building an index, I worked on making a "smarter grep" by offering search over codebases (and any text content really) with ranking and some structural awareness of the code. Most of my time was spend dealing with performance, and as a result it runs extremely quickly.

I will have to add this as a comparison to https://github.com/boyter/cs and see what my LLMs prefer for the sort of questions I ask. It too ships with MCP, but does NOT build an index for its search. I am very curious to see how it would rank seeing as it does not do basic BM25 but a code semantic variant of it.

This seems to work better for the "how does auth work" style of queries, while cs does "authenticate --only-declarations" and then weighs results based on content of the files, IE where matches are, in code, comments and the overall complexity of the file.

Have starred and will be watching.


I was going to share a link to this. Thank you for making `cs`, I use it both with LLMs and directly in the terminal, despite not performing indexing it's pretty fast for my needs. Also definitely planning to try out semble.


You are welcome. Glad to hear its working for you. I have a few ideas I am working on to improve its relevance too that I hope pan out.


Nice! Let us know if you have any feedback or results to share, would be happy to do the same.


Been working on https://searchcode.com/ again which I bought back, albeit as code search tool for LLMs. It solves the “should I use this library” by allowing the LLM to inspect search and analyse it before integration. Can use it to compare multiple repositories before downloading. It comes with a large amount of token savings and can be really useful when wanting to learn about a codebase.

Since it does it anyway I added dossier pages to it as well https://searchcode.com/repo/github.com/rust-lang/rust Which is useful for humans, and shows what the system is creating.

Best part is that I get to use the tools I have built, so https://github.com/boyter/scc and https://github.com/boyter/cs to improve it which benefits anyone using those tools.


Many Muslims drink anyway. A lot of those in Iran brew wine/beer in their house.

Tobacco in Australia has been taxed to the point we have a huge black market for it now. You would have thought people would have learnt from prohibition.

You cannot police morals.


Id be ok with that if wine had the same taste. No alcohol free wine tastes even close, and none of them are good in their own right.

Some of the non-alcoholic beers are pretty good though and I am happy to drink them.


i cannot wait for an actually drinkable NA wine of any vintage. the last one i had was horrible.

same with the NA whiskey i had.

phony negroni, however, is pretty solid.


I reimagined https://searchcode.com/ since I realised LLMs have issues when it comes to understanding code you want to integrate. It’s useful for looking though any codebase, or multiple without having to clone it.

I use it when I have candidate libraries to solve a problem, or I just want to find out how things work. Most recently I pointed it at fzf and was able to pull the insensitive SIMD matching it uses and speed my own projects up.

I can’t find it right now, but there was a post about how ripgrep worked from a someone who walked through the code finding interesting patterns and doing a write up on it. With this I get it over any codebase I find interesting, or can even compare them.


I read this when it came out and having written similar things for searchcode.com (back when it was a spanning code search engine), and while interesting I have questions about,

    We routinely see rg invocations that take more than 15 seconds
The only way that works is if you are running it over repos 100-200 gigabytes in size, or they are sitting on a spinning rust HDD, OR its matching so many lines that the print is the dominant part of the runtime, and its still over a very large codebase.

Now I totally believe codebases like this exist, but surely they aren't that common? I could understand this is for a single customer though!

Where this does fall down though is having to maintain that index. That's actually why when I was working on my own local code search tool boyter/cs on github I also just brute forced it. No index no problems, and with desktop CPU's coming out with 200mb of cache these days it seems increasingly like a winning approach.


Such a good read. I actually went back though it the other day to steal the searching for the least common byte idea out to speed up my search tool https://github.com/boyter/cs which when coupled with the simd upper lower search technique from fzf cut the wall clock runtime by a third.

There was this post from cursor https://cursor.com/blog/fast-regex-search today about building an index for agents due to them hitting a limit on ripgrep, but I’m not sure what codebase they are hitting that warrants it. Especially since they would have to be at 100-200 GB to be getting to 15s of runtime. Unless it’s all matches that is.


Yeah, that Cursor blog post is a bit iffy since they just brush over the "ripgrep is slow on large monorepos", move on to techniques they used, and then completely ignore the fact that you have to build and maintain the index.

On a mid-size codebase, I fzf- and rg-ed through the code almost instantly, while watching my coworker's computer slow down to a crawl when Pycharm started reindexing the project.


I'm not into the low level minutiae but on large code bases I sometimes see a lag on the first rg:s I run and then it's fast, which I attribute to some OS level caching stuff.

Perhaps they run their software on operating or file systems that can't do it, or on hardware with different constraints than the workstation flavoured laptops I use.


The disk cache has a huge impact. However they claim it’s for multiple searches so it should be in it.


Its pretty easy to step over those limits.

Also localhost and presumably this are good for validating your logic before you throw in roles, network and everything else that can be an issue on AWS.

Confirm it runs in this, and 99% of the time the issue when you deploy is something in the AWS config, not your logic.


>> "It's pretty easy to step over those limits."

Exactly, especially when people are starting out, don't have a clear understanding of the inner workings of the system for whatever reason. Jobs are getting harder to find nowadays and if during learning, you make one mistake, you either pay or the learning stops.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: