I always had problems understanding algorithms from such descriptions. Even from pseudocode I find it difficult to understand. What I usually do is search for an implementation, even in a language I am not familiar with is still better. When you have code you can run it, test it, debug it - not so much with descriptions and pseudocode.
Not op, but I think "doesn't play nicely" means does not work so you have to do it in other ways. This has been my experience as well, albeit it was couple of years ago.
As LLM benchmarks go, this is not a bad take at all.
One interesting point about this approach is that is self balancing, so when more powerful models come up, there is no need to change it.
Author here - yes, I'm regularly adding new models to this and other TrueSkill-based benchmarks and it works well. One thing to keep in mind is the need to run multiple passes of TrueSkill with randomly ordered games, because both TrueSkill and Elo are designed to be order-sensitive, as people's skills change over time.
Still can't get past "Gleam doesn't have loops". I did try to stick with it as the doc suggests, but I guess is a steep transition for me. I like everything else about the language.
The number of people is irrelevant. What is relevant is what each one did. If they did something illegal that is punished with prison time, they go to prison.
Thank you. I try to keep it neat and tidy, despite the ads. I spent more time on the drag and drop aesthetics than anything else before I first released it. The little things really do matter, after all.