Troublesome characters: After PUA

Multi-byte characters after PUA that cause some amount of trouble.

note: U+F900 - U+FFFF

note: Standardized (non-PUA) (displayable in all fonts)

Codepoint	Description	WL
U+FDD0 - U+FDEF XXX
U+FE00 - U+FE0F XXX
U+FEFF	ZERO WIDTH NO-BREAK SPACE
U+FF01 - U+FF60	Fullwidth
U+FF61 - U+FFEE	Halfwidth
U+FFFD	REPLACEMENT CHARACTER	`\[UnknownGlyph]`
U+FFFE - U+FFFF XXX
beyond U+FFFF

Noncharacters

BMP non-characters

https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Non-characters

BMP non-characters https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Non-characters U+FFFE U+FFFF

Variation Selectors

Variation Selectors U+FE00 - U+FE0F

tonsky?

XXX

U+FEFF ZERO WIDTH NO-BREAK SPACE

U+FEFF

BOM

Zero Width No-Break Space

https://cmake.org/Bug/view.php?id=15493

https://bugreports.qt.io/browse/QTBUG-34182

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56549

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

Fullwidth

U+FF01—U+FF60

Fullwidth (U+FF01—U+FF60) and halfwidth (U+FF61—U+FFEE) characters have been used in 2007 to bypass security checks. Examples with the Unicode normalization:

    U+FF0E is normalized to . (U+002E) in NFKC
    U+FF0F is normalized to / (U+002F) in NFKC

U+FF0C FULLWIDTH COMMA

\:ff0c Fullwidth Comma

accidentally added by Japanese translations sometimes

related to confusables: https://www.unicode.org/Public/security/8.0.0/confusables.txt http://www.unicode.org/reports/tr39/#Confusable_Detection http://www.unicode.org/Public/security/latest/confusables.txt

https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)

U+FF0E

U+FF0F

U+FF3C

Some important characters have also “alternatives” in Unicode:

    Windows directory separator, \ (U+005C): U+20E5, U+FF3C
    UNIX directory separator, / (U+002F): U+2215, U+FF0F
    Parent directory, .. (U+002E, U+002E): U+FF0E

Halfwidth

U+FF61 - U+FFEE

Fullwidth (U+FF01—U+FF60) and halfwidth (U+FF61—U+FFEE) characters have been used in 2007 to bypass security checks. Examples with the Unicode normalization:

    U+FF0E is normalized to . (U+002E) in NFKC
    U+FF0F is normalized to / (U+002F) in NFKC

XXX

U+FFFD REPLACEMENT CHARACTER \[UnknownGlyph]

// // REPLACEMENT CHARACTER // // This can be the result of badly encoded UTF-8 // case 0xfffd:

non-BMP

note: describe \|XXXXXX syntax

note: after U+FFFF

non-BMP non-characters

https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Non-characters

non-BMP PUA

no long names here yet

non-BMP Standardized (non-PUA)

emojis

joining characters actually cause a lot of trouble

Table of contents

Noncharacters

Variation Selectors

XXX

Fullwidth

Halfwidth

XXX

non-BMP