I wonder if they've trained the model to operate with a shallower stack; eg. the full model may be composed of 24 transformer blocks, but they've also trained it to accept embeddings at layer 8, so it can be operated with just 16 transformer blocks on lower-resourced devices.
Experimenters in the open source tinkering community have done the opposite (copy/pasting layers in existing models to make them deeper) and it seems to work... fine, with minimal post-training on the new, deeper model required to exceed the performance of the original model. So it's not a crazy idea.
It seems to be embedding from 262k possible vocab tokens down to 256 dims. 262144 matches the same vocab size used for the existing Gemma model, so it really does seem to be an embedding of the input token directly, fed into each layer.
I guess intuitively it might help the model somewhat for later layers to have direct access to the input query without needing to encode it in the residual stream, and it can use those parameters for something else. I'm kind of surprised no one tried this before, if the idea is that simple? Reminds me of resnet where you have the "skip" layers so future layers can access the input directly.
Edit: As for what exactly the embedding is used for, it could be that the embedding is still used for something more clever than induction head-type stuff. Responses in [1] suggest it might be some low-rank data/token dependent signal that can be "factored out"/precomputed. Another clever suggestion was that it's a per-layer input-token-derived control/steering vector.
Experimenters in the open source tinkering community have done the opposite (copy/pasting layers in existing models to make them deeper) and it seems to work... fine, with minimal post-training on the new, deeper model required to exceed the performance of the original model. So it's not a crazy idea.