> nothing in ASCII can appear anywhere else UTF-8, and more generally that no UT... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		wolverine876 on Feb 9, 2022 \| parent \| context \| favorite \| on: How UTF-8 Works > nothing in ASCII can appear anywhere else UTF-8, and more generally that no UTF-8 character can appear as a substring of another character’s encoding How is that defined and enforced? Very narrowly, it seems to me: * ASCII Hyphen-minus (U+002D) has similar functions and appearance to Small Hyphen-minus (U+FE63), Fullwidth Hyphen-minus (U+FF0D), Hyphen (U+2010), Minus Sign (U+2212), Heavy Minus Sign (U+2796), En dash (U+2013), Em Dash (U+2014), Small Em Dash (U+FE58), Horizontal Bar (U+2015), Figure dash (U+2012). (I'm probably missing a few!) * There are separate delta symbols for Greek and for mathematics (sorry, no more time for looking up code points). * Very many other characters have appearances so similar that nobody could tell them apart. So characters have apparently, to users and almost anyone not looking at the actual codes, identical functions and appearances.

dahfizz on Feb 9, 2022 [–]

OP (and the article) is talking about the encoding of UTF-8, not Unicode in general.

ASCII is itself valid utf-8, because ASCII is a subset of utf-8. But a multi byte encoded codepoint in UTF-8 cannot be confused with ASCII, because the highest bit is set in all the octets.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact