Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd say it depends. For the total parameter count, you should just count all parameters, since that's what matters for memory requirements.

For activated parameters: All unembedding parameters are used in every inference step during token generation, but only one column of the embeddings is used (if done right). So count accordingly, since that's what matters for memory bandwidth and therefore latency.



That makes sense, thanks for the info. Here's a quick recap of the recent MoE models based on the criteria..

correct activated params:

  * DeepSeek V3/R1 series
  * Kimi K2
  * GPT-OSS series
undercount activated params:

  * GLM-4.5 series
overcount activated params:

  * DeepSeek V2 series
  * Qwen3 series
  * Ernie 4.5 series
  * Hunyuan A13B




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: