Multi-byte characters before PUA that cause some amount of trouble.

note: PUA is U+0100 - U+DFFF

note: Standardized (non-PUA) (displayable in all fonts)

Table of contents

Codepoint Description WL
U+034F COMBINING GRAPHEME JOINER  
U+0430 CYRILLIC SMALL LETTER A  
U+1680 Ogham Space Mark  
U+180E Mongolian Vowel Separator  
U+2000 En Quad  
U+2001 Em Quad  
U+2002 En Space  
U+2003 Em Space  
U+2004 Three-Per-Em  
U+2005 FOUR-PER-EM-SPACE \[ThickSpace]
U+2006 Six-Per-Em  
U+2007 Figure Space  
U+2008 Punctuation Space  
U+2009 THIN SPACE \[ThinSpace]
U+200A HAIR SPACE \[VeryThinSpace]
U+200B ZERO WIDTH SPACE  
U+200C ZERO WIDTH NON-JOINER  
U+200D ZERO WIDTH JOINER  
U+200E LEFT-TO-RIGHT-MARK  
U+2010 HYPHEN \[Hyphen]
U+2013 DASH \[Dash]
U+2014 LONG DASH \[LongDash]
U+2018 Open Curly Quote \[OpenCurlyQuote]
U+2019 Close Curly Quote \[CloseCurlyQuote]
U+201C Open Curly Double Quote \[OpenCurlyDoubleQuote]
U+201D Close Curly Double Quote \[CloseCurlyDoubleQuote]
U+2028 LINE SEPARATOR  
U+2029 Paragraph Separator  
U+202A LEFT-TO-RIGHT EMBEDDING  
U+202B RIGHT-TO-LEFT EMBEDDING  
U+202C POP DIRECTIONAL FORMATTING  
U+202D LEFT-TO-RIGHT OVERRIDE  
U+202E RIGHT-TO-LEFT OVERRIDE  
U+202F Narrow No-Break Space  
U+2043 HYPHEN BULLET \[SkeletonIndicator]
U+205F MEDIUM MATHEMATICAL SPACE \[MediumSpace]
U+2060 WORD JOINER \[NoBreak]
U+2061 FUNCTION APPLICATION  
U+2062 INVISIBLE TIMES \[InvisibleTimes]
U+2063 INVISIBLE SEPARATOR  
U+2064 INVISIBLE PLUS  
U+20E5    
U+2192 RIGHTWARDS ARROW  
U+2215    
U+2423 OPEN BOX \[SpaceIndicator]
U+29F4 RULE DELAYED  
U+3000 IDEOGRAPHIC SPACE  
U+3001 IDEOGRAPHIC COMMA  
U+D800 - U+DBFF    
U+DC00 - U+DFFF    

U+034F COMBINING GRAPHEME JOINER

No visible glpyh

https://twitter.com/wilbowma/status/1383910966748803075

invisible

U+0430 CYRILLIC SMALL LETTER A

https://bugs.wolfram.com/show?number=401640

U+1680

\u1680 - Ogham Space Mark

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+180E

\u180E - Mongolian Vowel Separator -

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2000

\u2000 - En Quad

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2001

\u2001 - Em Quad

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2002

\u2002 - En Space -

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2003

\u2003 - Em Space -

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2004

\u2004 - Three-Per-Em

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2005 FOUR-PER-EM-SPACE \[ThickSpace]

U+2005: \[ThickSpace]

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2006

\u2006 - Six-Per-Em

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2007

\u2007 - Figure Space

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2008

\u2008 - Punctuation Space -

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2009 THIN SPACE \[ThinSpace]

U+2009: \[ThinSpace]

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+200A HAIR SPACE \[VeryThinSpace]

U+200a: \[VeryThinSpace]

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+200B ZERO WIDTH SPACE

U+200b: ZERO WIDTH SPACE

not the same thing as U+F360 \[InvisibleSpace]

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+200C ZERO WIDTH NON-JOINER

U+200c: ZERO WIDTH NON-JOINER

invisible

actually the cause of a lot of the iOS crashing bugs:

“Telugu Text bomb”

https://manishearth.github.io/blog/2018/02/15/picking-apart-the-crashing-ios-string/ U+0C1C U+0C4D U+0C1E U+200C U+0C3E, which is a sequence of Telugu characters: the consonant ja (జ), a virama ( ్ ), the consonant nya (ఞ), a zero-width non-joiner, and the vowel aa ( ా

U+200D ZERO WIDTH JOINER

U+200d: ZERO WIDTH JOINER

ZERO WIDTH JOINER is really its own thing

https://tonsky.me/blog/emoji/

invisible

U+200E LEFT-TO-RIGHT-MARK

have the semantics of an invisible character of zero width

invisible

U+2018, U+2019, U+201C, U+201D Curly Quotes

THESE ARE VERY TROUBLESOME!!

“OpenCurlyQuote” -> {PunctuationCharacter, 16^^2018, < “ASCIIReplacements” -> {“’”} >},
“CloseCurlyQuote” -> {PunctuationCharacter, 16^^2019, < “ASCIIReplacements” -> {“’”} >},
“OpenCurlyDoubleQuote” -> {PunctuationCharacter, 16^^201c, < “ASCIIReplacements” -> {“"”} >},
“CloseCurlyDoubleQuote” -> {PunctuationCharacter, 16^^201d, < “ASCIIReplacements” -> {“"”} >},

copy / pasting from Word and text editors

U+2028 LINE SEPARATOR

// // LINE SEPARATOR // // case 0x2028: // return true;

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2029

\u2029 - Paragraph Separator

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+202A

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

U+202B

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

U+202C

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

U+202D

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

U+202E

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

U+202F

\u202F - Narrow No-Break Space

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2043 HYPHEN BULLET

“SkeletonIndicator” -> {UninterpretableCharacter, 16^^2043, < “ASCIIReplacements” -> {“-“} >},

U+205F MEDIUM MATHEMATICAL SPACE

\[MediumSpace] U+205f

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2060 WORD JOINER

// // WORD JOINER //

\[NoBreak] U+2060 // This is the character that is recommended to use for ZERO WIDTH NON-BREAKING SPACE // https://unicode.org/faq/utf_bom.html#bom6 // // case 0x2060: // return true;

invisible

U+2061 FUNCTION APPLICATION

FUNCTION APPLICATION U+2061

invisible

not the same as U+F76D \[InvisibleApplication]

where does U+2061 (actual Function Application character) come from?

&ApplyFunction;

Peter Fleck customer

https://tex.stackexchange.com/questions/552692/unicode-character-u2061-inputenc-not-set-up-for-use-with-latex

it comes from MathML:

In[2]:= ExportString[f[x], “MathML”] Out[2]=

f ( x )

</math>

10^^8289 === 16^^2061

and AssembleFunctionCall in TypesetInit.m

U+2062 INVISIBLE TIMES

INVISIBLE TIMES

\[InvisibleTimes] U+2062

invisible

U+2063 INVISIBLE SEPARATOR

INVISIBLE SEPARATOR

invisible

not the same thing as U+F765 \[InvisibleComma]

U+2064 INVISIBLE PLUS

INVISIBLE PLUS

invisible

not the same thing as U+F39E \[ImplicitPlus]

U+20E5

Some important characters have also “alternatives” in Unicode:

    Windows directory separator, \ (U+005C): U+20E5, U+FF3C
    UNIX directory separator, / (U+002F): U+2215, U+FF0F
    Parent directory, .. (U+002E, U+002E): U+FF0E

U+2192 RIGHTWARDS ARROW

confusible with [Rule]

U+2215

Some important characters have also “alternatives” in Unicode:

    Windows directory separator, \ (U+005C): U+20E5, U+FF3C
    UNIX directory separator, / (U+002F): U+2215, U+FF0F
    Parent directory, .. (U+002E, U+002E): U+FF0E

U+2423 OPEN BOX

\[SpaceIndicator] U+2423

U+29F4 RULE DELAYED

confusible with [RuleDelayed]

U+3000 IDEOGRAPHIC SPACE

“COMPATIBILITYKanjiSpace” -> {UnsupportedCharacter, 16^^3000, <   >},

\:3000 Ideographic Space

accidentally added by Japanese translations sometimes

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+3001 IDEOGRAPHIC COMMA

\:3001 Ideographic Comma

accidentally added by Japanese translations sometimes

related to confusables: https://www.unicode.org/Public/security/8.0.0/confusables.txt http://www.unicode.org/reports/tr39/#Confusable_Detection http://www.unicode.org/Public/security/latest/confusables.txt

Surrogates

Encoding problems:

stray surrogates D800–DBFF DC00–DFFF

invisible?

https://unicodebook.readthedocs.io/issues.html#strict-utf8-decoder

Surrogates characters are also invalid in UTF-8: characters in U+D800—U+DFFF have to be rejected.

FE - Kernel difference

troublesome characters

https://bugs.wolfram.com/show?number=172258

Ran into this while fuzz testing. These are the characters that are documented as letter-like, yet result in a RowBox[{“a”,”xxx”,”b”}] when typed into the FE:

[Dash]

[LongDash]

[Hyphen]

For example, typing a[Dash]b into the FE results in RowBox[{“a”, “[Dash]”, “b”}] Expected: was “a[Dash]b”