Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So GLM-4.5 series omits the embedding layer and the output layer when counting both the total parameters and the active parameters:

> When counting parameters, for GLM-4.5 and GLM-4.5-Air, we include the parameters of MTP layers but not word embeddings and the output layer.

This matches with the calculation I did for GLM-4.5 (355B A32B):

    In [14]: 356732107008 - (775946240 * 2) # token_embd / output are 775946240 each. assume omitted
    Out[14]: 355180214528

    In [15]: 356732107008 - 339738624000 - (775946240 * 2) # parameters that are always active
    Out[15]: 15441590528

    In [16]: 339738624000 * 8 / 160 # parameters from activated experts
    Out[16]: 16986931200.0
Meanwhile, GPT OSS series includes both the embedding layer and the output layer when counting the total parameters, but only includes the output layer when counting the active parameters:

> We refer to the models as “120b” and “20b” for simplicity, though they technically have 116.8B and 20.9B parameters, respectively. Unembedding parameters are counted towards active, but not embeddings.

And Qwen3 series includes both the embedding layer and the output layer when counting both the total parameters and the active parameters.

Why there is no standard in counting? Which approach is more accurate?



I'd say it depends. For the total parameter count, you should just count all parameters, since that's what matters for memory requirements.

For activated parameters: All unembedding parameters are used in every inference step during token generation, but only one column of the embeddings is used (if done right). So count accordingly, since that's what matters for memory bandwidth and therefore latency.


That makes sense, thanks for the info. Here's a quick recap of the recent MoE models based on the criteria..

correct activated params:

  * DeepSeek V3/R1 series
  * Kimi K2
  * GPT-OSS series
undercount activated params:

  * GLM-4.5 series
overcount activated params:

  * DeepSeek V2 series
  * Qwen3 series
  * Ernie 4.5 series
  * Hunyuan A13B




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: