Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Glitch tokens should be tokenizer-specific. Gemini uses a different tokenizer from the OpenAI models.

The origins of the OpenAI glitch tokens are pretty interesting: the trained an early tokenizer on common strings in their early training data but it turns out popular subreddits caused some weird tokens to be common enough to get assigned an integer, like davidjl - a frequent poster in the https://reddit.com/r/counting subreddit. More on that here: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#glitch-...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: