Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It’s probably more. Pretty conservatively, if the KV embedding dimension for each token is ~10K x 100 attention layers (this is roughly the scale of Llama3.1 405B) that’s already 1M 16-bit floats per token = 2MB. They have likely needed to implement some kind of KV compression (like DeepSeek) to make this even feasible.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: