right. this is a proposal that needs to be tested. I started testing it on 30M parameters then I will move to a 100M and evaluate the generation on domain-specific assisting tasks
> This is obviously not powerful enough to express non-linear relationships - like graph relationships.
the distance metrics used is based on energy-informed graphs that encode energy relations in a distribution called taumode, see my previous paper on spectral indexing for vector databases for a complete roll-out
also: precomputing a sparse Laplacian for N vectors at dimension D (NxD) is infinitely cheaper (if using `arrowspace`, my previous paper) than computing distances on the same full dense vectors billions of times.
There are published tests that compute a Laplacian on 300Kx384 space in 500 secs on a laptop on CPU.
So it is a trade-off: potentially few minutes of pretaining or hours of dot-product on dense matrices
if you have a corpus of code snippets to train the manifold (Laplacian) on (and a good embedding model), it is definitely possible to try something like this.
it made sense to me as it is a very simple idea I guess: causal self-attention compute QKV distances computing on the full vectors for Q,K and V; the topological transformer can provide the same computation using Q, scalar K and V. Instead of [N², N², N²] -> [N², N, N²] is used. If generation is confirmed to be on par in terms of quality, the gains are evident.
it most-likely will in terms of performance as it uses 50% less memory (for sure it will at inference time that is the most used operation on web services), because it can leverage longer T and D if the design is confirmed and the quality of generation is comparable to other models.
If this very basic assumption is correct, it means a lot of savings in electricity as the same GPUs can resolve more requests.
Thanks to all that have read.
I would be glad to answer further scoped questions on the content of the post and the paper. I answered some comments that may clarify the ideas from the redesign.
the idea is to have a lot of "narrow" models to work with RAG instead of one model for all the knowledge domains or also distil the metadata that is currently in enterprise Knowledge Graphs
reply