Hacker Newsnew | past | comments | ask | show | jobs | submit | xfalcox's commentslogin

That is a great fit for the GIF integration in Discourse.

I was able to quickly add support for it at https://github.com/discourse/discourse-gifs/pull/107

Love to see WEBP support. Do you plan on adding support for AVIF?

Also, this is used by many Discourse sites, we should talk.


For sure! Hit us up at hi@klipy.com with more details - did you request production access via Partner Panel https://partner.klipy.com ?

Re AVIF, do you need those for specific purpose?


First time I was in San Francisco and someone introduced themselves like that, going even beyond, was indeed a super weird experience being a brazilian.

We have vLLM for running text LLMs in production. What is the equivalent for this model?


I would say there's isn't an equivalent. Some people will probably tell you ComfyUI - you can expose workflows via API endpoints and parameterize them. This is how e.g. Krita AI Diffusion uses a ComfyUI backend.

For various reasons, I doubt there are any large scale SaaS-style providers operating this in production today.


I'm intrigued by the various reasons why you think there are not any large scale SAAS operating this in production?


I was referring to attributes of the ComfyUI software/project, not the idea of serving an image generation API to be clear. There are several of those providers.


i dont believe there is a viable use case for large scale AI-generated images as there is for text... except for porn, but many orgs with SAAS capabilities wouldn't touch that


I am partial to https://huggingface.co/Qwen/Qwen3-Embedding-0.6B nowadays.

Open weights, multilingual, 32k context.


Also matryoshka and the ability to guide matches by using prefix instructions on the query.

I have ~50 million sentences from english project gutenberg novels embedded with this.


Why would you do that and I'd love to know more


The larger project is to allow analyzing stories for developmental editing.

Back in June and August i wrote some llm assisted blog posts about a few of the experiments.

They are here: sjsteiner.substack.com


What are you using those embeddings for, If you don't mind me asking? I'd love to know more about the workflow and what the prefix instructions are like.


It's junk compared to BGE M3 on my retrieval tasks


It's the Amazon own model. I'm baffled someone would pick it, even more that someone would test Llama 4 for a task in an age where Sonnet 4.5 is already out, so in the last 45 days.

Looks like they were limited by AWS Bedrock options.


> Nobody’s actually run this in production

We do at Discourse, in thousands of databases, and it's leveraged in most of the billions of page views we serve.

> Pre- vs. Post-Filtering (or: why you need to become a query planner expert)

This was fixed in version 0.8.0 via Iterative Scans (https://github.com/pgvector/pgvector?tab=readme-ov-file#iter...)

> Just use a real vector database

If you are running a single service that may be an easier sell, but it's not a silver bullet.


Also worth mentioning that we use quantization extensively:

- halfvec (16bit float) for storage - bit (binary vectors) for indexes

Which makes the storage cost and on-going performance good enough that we could enable this in all our hosting.


It still amazes me that the binary trick works.

For anyone who hasn't seen it yet: it turns out many embedding vectors of e.g. 1024 floating point numbers can be reduced to a single bit per value that records if it's higher or lower than 0... and in this reduced form much of the embedding math still works!

This means you can e.g. filter to the top 100 using extremely memory efficient and fast bit vectors, then run a more expensive distance calculation against those top 100 with the full floating point vectors to pick the top 10.


I was taken back when I saw what was basically zero recall loss in the real world task of finding related topics, by doing the same thing you described where we over capture with binary embeddings, and only use the full (or half) precision on the subset.

Making the storage cost of the index 32 times smaller is the difference of being able to offer this at scale without worrying too much about the overhead.


> I was taken back when I saw what was basically zero recall loss in the real world task of finding related topics

By moving the values to a single bit, you’re lumping stuff together that was different before, so I don’t think recall loss would be expected.

Also: even if your vector is only 100-dimensional, there already are 2^100 different bit vectors. That’s over 10^30.

If your dataset isn’t gigantic and has documents that are even moderately dispersed in that space, the likelihood of having many with the same bit vector isn’t large.


And if dispersion isn't good, it would be worthwhile running the vectors through another model trained to disperse them.


Depending on your data you might also get better results by applying a random rotation to your vector before quantization.

https://ieeexplore.ieee.org/abstract/document/6296665/ (https://refbase.cvc.uab.cat/files/GLG2012b.pdf)


why is this amazing, it’s just a 1 bit lossy compression representation of the original information? If you have a vector in n-dimensional space this is effectively just representing the basis vectors that the original has.


You can take 8192 bytes of information (1024 x 32 bit floats) and reduce that to 128 bytes (1024 bits, a 64x reduction in size!) and still get results that are about 95% as good.

I find that cool and surprising.


I'm with you, it's very satisfying to see a simple technique work well. It's impressive


1024 bits for a hash is pretty roomy. The embedding "just" has to be well-distributed across enough of the dimensions.


Yeah, that's what I was thinking: Did we think 32 bits across each of the 1024 dimensions would be necessary? Maybe 32768 bits is adding unnecessary precision to what is ~1024 bits of information in the first place.


That’s a much more interesting question, I wonder if there is a way to put a lower bound on the number of bits you could use?


Now that you mention that, I wonder if LSH would perform better with slightly higher memory footprint


That's where it's at. I'm using the 1600D vectors from OpenAI models for findsight.ai, stored SuperBit-quantized. Even without fancy indexing, a full scan (1 search vector -> 5M stored vectors), takes less than 40ms. And with basic binning, it's nearly instant.


this is at the expense of precision/recall though isn't it?


With the quant size I'm using, recall is >95%.


Approximate nearest neighbor searches don't cost precision. Just recall.


I was going to say the same. We're using binary vectors in prod as well. Makes a huge difference in the indexes. This wasn't mentioned once in the article.


Interested to hear more about your experience here. At Halcyon, we have trillions of embeddings and found Postgres to be unsuitable at several orders of magnitude less than we currently have.

On the iterative scan side, how do you prevent this from becoming too computationally intensive with a restrictive pre-filter, or simply not working at all? We use Vespa, which means effectively doing a map-reduce across all of our nodes; the effective number of graph traversals to do is smaller, and the computational burden mostly involves scanning posting lists on a per-node basis. I imagine to do something similar in postgres, you'd need sharded tables, and complicated application logic to control what you're actually searching.

How do you deal with re-indexing and/or denormalizing metadata for filtering? Do you simply accept that it'll take hours or days?

I agree with you, however, that vector databases are not a panacea (although they do remove a huge amount of devops work, which is worth a lot!). Vespa supports filtering across parent-child relationships (like a relational database) which means we don't have to reindex a trillion things every time we want to add a new type of filter, which with a previous vector database vendor we used took us almost a week.


We host thousands of forums but each one has its own database, which means we get a sort of free sharding of the data where each instance has less than a million topics on average.

I can totally see that at a trillion scale for a single shard you want a specialized dedicated service, but that is also true for most things in tech when you get to the extreme scale .


Thanks for the reply! This makes much more sense now. To preface, I think pgvector is incredibly awesome software, and I have to give huge kudos to the folks working on it. Super cool. That being said, I do think the author isn't being unreasonable in that the limitations of pgvector are very real when you're talking indices that grow beyond millions of things, and the "just use pgvector" crowd in general doesn't have a lot of experience with scaling things beyond toy examples. Folks should take a hard look at what size they expect their indices to grow to in the near-to-medium-term future.


for sure people are running pgvector in prd! i was more pointing at every tutorial

iterative scans are more of a bandaid for filtering than a solution. you will still run into issues with highly restrictive filters. you still need to understand ef_search and max_search_tuples. strict vs relaxed ordering, etc. it's an improvement for sure, but the planner still doesn't deeply understand the cost model of filtered vector search

there isn't a general solution to the pre- vs post-filter problem—it comes down to having a smart planner that understands your data distribution. question is whether you have the resources to build and tune that yourself or want to offload it to a service that's able to focus on it directly


I feel like this is more of a general critique about technology writing; there are always a lot of “getting started” tutorials for things, but there is a dearth of “how to actually use this thing in anger” documentation.


There are also approaches do doing the filtering while traversing a vector index (not just pre/post) e.g. this paper by microsoft explains an approach https://dl.acm.org/doi/10.1145/3543507.3583552 which pgvectorscale implements here: https://github.com/timescale/pgvectorscale?tab=readme-ov-fil...

In theory these can be more efficient than plain pre/post filtering.


pgvectorscale is not available in RDS so this wasnt a great solution for us! but it does likely solve many of the problems with vanilla pgvector (what this post was about)


What are you using it for? Is it part of a hybrid search system (keyword + vector)?


In Discourse embeddings power:

- Related Topics, a list of topics to read next, which uses embeddings of the current topic as the key to search for similar ones

- Suggesting tags and categories when composing a new topic

- Augmented search

- RAG for uploaded files


what does the rag for uploaded files do in discourse?

also, when i run a discourse search does it really do both a regular keyword search and a vector search? how do you combine results?

does all discourse instances have those features? for example, internals.rust-lang.org, do they use pgvector?


> what does the rag for uploaded files do in discourse?

You can upload files that will act as RAG files for an AI bot. The bot can also have access to forum content, plus the ability to run tools in our sandboxed JS environment, making it possible for Discourse to host AI bots.

> also, when i run a discourse search does it really do both a regular keyword search and a vector search? how do you combine results?

Yes, it does both. In the full page search it does keyword first, then vector asynchronously, which can be toggled by the user in the UI. It's auto toggled when keyword has zero results now. Results are combined using reciprocal rank fusion.

In the quick header search we simply append vector search to keyword search results when keyword returns less than 4 results.

> does all discourse instances have those features? for example, internals.rust-lang.org, do they use pgvector?

Yes, all use PGvector. In our hosting all instances default to having the vector features enabled, we run embeddings using https://github.com/huggingface/text-embeddings-inference


Thanks for the details. Also, always appreciated Discord's engineering blog posts. Lots of interesting stories, and nice to see a company discuss using Elixir at scale.


Discourse, not Discord.


Just migrated all embeddings to this same model a few weeks ago in my company, and it's a game changer. Having 32k context is a 64x increase when compared with our previous used model. Plus being natively multilingual and producing very standard 1024 long arrays made it a seamless transition even with millions of embeddings across thousands of databases.

I do recommend using https://github.com/huggingface/text-embeddings-inference for fast inference.


What does it mean to generate 1000 float16 array size on a 32k context? Surely the embedding you get is no longer representative of the text.


Depends on your needs. You surely don't want 32k long chunks for doing the standard RAG pipeline, that's for sure.

My use case is basically a recommendation engine, where retrieve a list of similar forum topics based on the current read one. As with dynamic user generated content, it can vary from 10 to 100k tokens. Ideally I would generate embeddings from an LLM generated summary, but that would increase inference costs considerably at the scale I'm applying it.

Having a larger possible context out of the box just made a simple swap of embeddeding models increase quality of recommendations greatly.


Having a public tokenizer is quite useful, specially for embeddings. It allows you to do the chunking locally without going to the internet.


and you don't have to re-embed everything if the provider sunsets a model.


Qwen 3 is not slow by any metrics.

Which model, inference software and hardware are you running it on?

The 30BA3B variant flies on any GPU.


You'd be surprised how often people in enterprise can be left waiting months to get an API key approved for an LLM provider.


Are you saying that it's faster for them to get the hardware to run the weights themselves? Otherwise I'm not sure what the relevancy is.


Unless they are already in the possession of such hardware (like an M3 mac, for example).


Yes some have existing infra


I'm having a somewhat hard time believing a corporation where getting a API key for a LLM service is very difficult, somehow has the (GPU) infrastructure already running for doing the same thing themselves, unless they happen to be a ML corporation, but I don't think we're talking about those in this context.


No, this is very real. One reason why this can happen: a company has elaborate processes for protecting their internal data from leaking, but otherwise lets engineers do what they want with resources allocated to them.


Nah this is definitely a real scenario. Getting access to public models requires a lot of security review, but proving through Bedrock is much more simple. I may be spoiled in having worked for companies that have ML departments and developer XP departments though.


Not sure Bedrock counts as self-hosting though, isn't it a managed service Amazon provides?

> I may be spoiled in having worked for companies that have ML

Sounds likely, yeah, how many companies have ML departments today? DS departments seem common, but ML i'm not too sure about


A lot of companies think they do


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: