Troublesome characters: After PUA
Multi-byte characters after PUA that cause some amount of trouble.
note: U+F900 - U+FFFF
note: Standardized (non-PUA) (displayable in all fonts)
Table of contents
Codepoint | Description | WL |
---|---|---|
U+FDD0 - U+FDEF XXX | ||
U+FE00 - U+FE0F XXX | ||
U+FEFF | ZERO WIDTH NO-BREAK SPACE | |
U+FF01 - U+FF60 | Fullwidth | |
U+FF61 - U+FFEE | Halfwidth | |
U+FFFD | REPLACEMENT CHARACTER | \[UnknownGlyph] |
U+FFFE - U+FFFF XXX | ||
beyond U+FFFF |
Noncharacters
BMP non-characters
https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Non-characters
BMP non-characters https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Non-characters U+FFFE U+FFFF
Variation Selectors
Variation Selectors U+FE00 - U+FE0F
tonsky?
XXX
U+FEFF ZERO WIDTH NO-BREAK SPACE
U+FEFF
BOM
Zero Width No-Break Space
https://cmake.org/Bug/view.php?id=15493
https://bugreports.qt.io/browse/QTBUG-34182
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56549
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
Fullwidth
U+FF01—U+FF60
Fullwidth (U+FF01—U+FF60) and halfwidth (U+FF61—U+FFEE) characters have been used in 2007 to bypass security checks. Examples with the Unicode normalization:
U+FF0E is normalized to . (U+002E) in NFKC
U+FF0F is normalized to / (U+002F) in NFKC
U+FF0C FULLWIDTH COMMA
\:ff0c Fullwidth Comma
accidentally added by Japanese translations sometimes
related to confusables: https://www.unicode.org/Public/security/8.0.0/confusables.txt http://www.unicode.org/reports/tr39/#Confusable_Detection http://www.unicode.org/Public/security/latest/confusables.txt
https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)
U+FF0E
U+FF0F
U+FF3C
Some important characters have also “alternatives” in Unicode:
Windows directory separator, \ (U+005C): U+20E5, U+FF3C
UNIX directory separator, / (U+002F): U+2215, U+FF0F
Parent directory, .. (U+002E, U+002E): U+FF0E
Halfwidth
U+FF61 - U+FFEE
Fullwidth (U+FF01—U+FF60) and halfwidth (U+FF61—U+FFEE) characters have been used in 2007 to bypass security checks. Examples with the Unicode normalization:
U+FF0E is normalized to . (U+002E) in NFKC
U+FF0F is normalized to / (U+002F) in NFKC
XXX
U+FFFD REPLACEMENT CHARACTER
\[UnknownGlyph]
// // REPLACEMENT CHARACTER // // This can be the result of badly encoded UTF-8 // case 0xfffd:
non-BMP
note: describe \|XXXXXX
syntax
note: after U+FFFF
non-BMP non-characters
https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Non-characters
non-BMP PUA
no long names here yet
non-BMP Standardized (non-PUA)
emojis
joining characters actually cause a lot of trouble