I wonder if they've trained the model to operate with a shallower stack; eg. the...

krackers · 2025-05-21T05:58:18 1747807098

Someone extracted out the dims in https://twitter.com/cccntu/status/1925043973170856393#m

It seems to be embedding from 262k possible vocab tokens down to 256 dims. 262144 matches the same vocab size used for the existing Gemma model, so it really does seem to be an embedding of the input token directly, fed into each layer.

I guess intuitively it might help the model somewhat for later layers to have direct access to the input query without needing to encode it in the residual stream, and it can use those parameters for something else. I'm kind of surprised no one tried this before, if the idea is that simple? Reminds me of resnet where you have the "skip" layers so future layers can access the input directly.

Edit: As for what exactly the embedding is used for, it could be that the embedding is still used for something more clever than induction head-type stuff. Responses in [1] suggest it might be some low-rank data/token dependent signal that can be "factored out"/precomputed. Another clever suggestion was that it's a per-layer input-token-derived control/steering vector.

[1] https://twitter.com/CFGeek/status/1925022266360005096