Troublesome characters: Before PUA

Multi-byte characters before PUA that cause some amount of trouble.

note: PUA is U+0100 - U+DFFF

note: Standardized (non-PUA) (displayable in all fonts)

Codepoint	Description	WL
U+034F	COMBINING GRAPHEME JOINER
U+0430	CYRILLIC SMALL LETTER A
U+1680	Ogham Space Mark
U+180E	Mongolian Vowel Separator
U+2000	En Quad
U+2001	Em Quad
U+2002	En Space
U+2003	Em Space
U+2004	Three-Per-Em
U+2005	FOUR-PER-EM-SPACE	`\[ThickSpace]`
U+2006	Six-Per-Em
U+2007	Figure Space
U+2008	Punctuation Space
U+2009	THIN SPACE	`\[ThinSpace]`
U+200A	HAIR SPACE	`\[VeryThinSpace]`
U+200B	ZERO WIDTH SPACE
U+200C	ZERO WIDTH NON-JOINER
U+200D	ZERO WIDTH JOINER
U+200E	LEFT-TO-RIGHT-MARK
U+2010	HYPHEN	`\[Hyphen]`
U+2013	DASH	`\[Dash]`
U+2014	LONG DASH	`\[LongDash]`
U+2018	Open Curly Quote	`\[OpenCurlyQuote]`
U+2019	Close Curly Quote	`\[CloseCurlyQuote]`
U+201C	Open Curly Double Quote	`\[OpenCurlyDoubleQuote]`
U+201D	Close Curly Double Quote	`\[CloseCurlyDoubleQuote]`
U+2028	LINE SEPARATOR
U+2029	Paragraph Separator
U+202A	LEFT-TO-RIGHT EMBEDDING
U+202B	RIGHT-TO-LEFT EMBEDDING
U+202C	POP DIRECTIONAL FORMATTING
U+202D	LEFT-TO-RIGHT OVERRIDE
U+202E	RIGHT-TO-LEFT OVERRIDE
U+202F	Narrow No-Break Space
U+2043	HYPHEN BULLET	`\[SkeletonIndicator]`
U+205F	MEDIUM MATHEMATICAL SPACE	`\[MediumSpace]`
U+2060	WORD JOINER	`\[NoBreak]`
U+2061	FUNCTION APPLICATION
U+2062	INVISIBLE TIMES	`\[InvisibleTimes]`
U+2063	INVISIBLE SEPARATOR
U+2064	INVISIBLE PLUS
U+20E5
U+2192	RIGHTWARDS ARROW
U+2215
U+2423	OPEN BOX	`\[SpaceIndicator]`
U+29F4	RULE DELAYED
U+3000	IDEOGRAPHIC SPACE
U+3001	IDEOGRAPHIC COMMA
U+D800 - U+DBFF
U+DC00 - U+DFFF

U+034F `COMBINING GRAPHEME JOINER`

No visible glpyh

https://twitter.com/wilbowma/status/1383910966748803075

invisible

U+0430 `CYRILLIC SMALL LETTER A`

https://bugs.wolfram.com/show?number=401640

U+1680

\u1680 - Ogham Space Mark

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+180E

\u180E - Mongolian Vowel Separator -

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2000

\u2000 - En Quad

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2001

\u2001 - Em Quad

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2002

\u2002 - En Space -

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2003

\u2003 - Em Space -

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2004

\u2004 - Three-Per-Em

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2005 `FOUR-PER-EM-SPACE` `\[ThickSpace]`

U+2005: \[ThickSpace]

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2006

\u2006 - Six-Per-Em

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2007

\u2007 - Figure Space

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2008

\u2008 - Punctuation Space -

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2009 `THIN SPACE` `\[ThinSpace]`

U+2009: \[ThinSpace]

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+200A `HAIR SPACE` `\[VeryThinSpace]`

U+200a: \[VeryThinSpace]

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+200B `ZERO WIDTH SPACE`

U+200b: ZERO WIDTH SPACE

not the same thing as U+F360 \[InvisibleSpace]

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+200C `ZERO WIDTH NON-JOINER`

U+200c: ZERO WIDTH NON-JOINER

invisible

actually the cause of a lot of the iOS crashing bugs:

“Telugu Text bomb”

https://manishearth.github.io/blog/2018/02/15/picking-apart-the-crashing-ios-string/ U+0C1C U+0C4D U+0C1E U+200C U+0C3E, which is a sequence of Telugu characters: the consonant ja (జ), a virama ( ్ ), the consonant nya (ఞ), a zero-width non-joiner, and the vowel aa ( ా

U+200D `ZERO WIDTH JOINER`

U+200d: ZERO WIDTH JOINER

ZERO WIDTH JOINER is really its own thing

https://tonsky.me/blog/emoji/

invisible

U+200E `LEFT-TO-RIGHT-MARK`

have the semantics of an invisible character of zero width

invisible

U+2018, U+2019, U+201C, U+201D Curly Quotes

THESE ARE VERY TROUBLESOME!!

“OpenCurlyQuote” -> {PunctuationCharacter, 16^^2018, <	“ASCIIReplacements” -> {“’”}	>},
“CloseCurlyQuote” -> {PunctuationCharacter, 16^^2019, <	“ASCIIReplacements” -> {“’”}	>},
“OpenCurlyDoubleQuote” -> {PunctuationCharacter, 16^^201c, <	“ASCIIReplacements” -> {“"”}	>},
“CloseCurlyDoubleQuote” -> {PunctuationCharacter, 16^^201d, <	“ASCIIReplacements” -> {“"”}	>},

copy / pasting from Word and text editors

U+2028 `LINE SEPARATOR`

// // LINE SEPARATOR // // case 0x2028: // return true;

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2029

\u2029 - Paragraph Separator

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+202A

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

U+202B

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

U+202C

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

U+202D

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

U+202E

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

U+202F

\u202F - Narrow No-Break Space

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2043 `HYPHEN BULLET`

“SkeletonIndicator” -> {UninterpretableCharacter, 16^^2043, <

“ASCIIReplacements” -> {“-“}

>},

U+205F `MEDIUM MATHEMATICAL SPACE`

\[MediumSpace] U+205f

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+2060 `WORD JOINER`

// // WORD JOINER //

\[NoBreak] U+2060 // This is the character that is recommended to use for ZERO WIDTH NON-BREAKING SPACE // https://unicode.org/faq/utf_bom.html#bom6 // // case 0x2060: // return true;

invisible

U+2061 `FUNCTION APPLICATION`

FUNCTION APPLICATION U+2061

invisible

not the same as U+F76D \[InvisibleApplication]

where does U+2061 (actual Function Application character) come from?

⁡

Peter Fleck customer

https://tex.stackexchange.com/questions/552692/unicode-character-u2061-inputenc-not-set-up-for-use-with-latex

it comes from MathML:

In[2]:= ExportString[f[x], “MathML”] Out[2]=

f ⁡ ( x )

</math>

10^^8289 === 16^^2061

and AssembleFunctionCall in TypesetInit.m

U+2062 `INVISIBLE TIMES`

INVISIBLE TIMES

\[InvisibleTimes] U+2062

invisible

U+2063 `INVISIBLE SEPARATOR`

INVISIBLE SEPARATOR

invisible

not the same thing as U+F765 \[InvisibleComma]

U+2064 `INVISIBLE PLUS`

INVISIBLE PLUS

invisible

not the same thing as U+F39E \[ImplicitPlus]

U+20E5

Some important characters have also “alternatives” in Unicode:

    Windows directory separator, \ (U+005C): U+20E5, U+FF3C
    UNIX directory separator, / (U+002F): U+2215, U+FF0F
    Parent directory, .. (U+002E, U+002E): U+FF0E

U+2192 `RIGHTWARDS ARROW`

confusible with [Rule]

U+2215

Some important characters have also “alternatives” in Unicode:

    Windows directory separator, \ (U+005C): U+20E5, U+FF3C
    UNIX directory separator, / (U+002F): U+2215, U+FF0F
    Parent directory, .. (U+002E, U+002E): U+FF0E

U+2423 `OPEN BOX`

\[SpaceIndicator] U+2423

U+29F4 `RULE DELAYED`

confusible with [RuleDelayed]

U+3000 `IDEOGRAPHIC SPACE`

“COMPATIBILITYKanjiSpace” -> {UnsupportedCharacter, 16^^3000, <

>},

\:3000 Ideographic Space

accidentally added by Japanese translations sometimes

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+3001 `IDEOGRAPHIC COMMA`

\:3001 Ideographic Comma

accidentally added by Japanese translations sometimes

related to confusables: https://www.unicode.org/Public/security/8.0.0/confusables.txt http://www.unicode.org/reports/tr39/#Confusable_Detection http://www.unicode.org/Public/security/latest/confusables.txt

Surrogates

Encoding problems:

stray surrogates D800–DBFF DC00–DFFF

invisible?

https://unicodebook.readthedocs.io/issues.html#strict-utf8-decoder

Surrogates characters are also invalid in UTF-8: characters in U+D800—U+DFFF have to be rejected.

FE - Kernel difference

troublesome characters

https://bugs.wolfram.com/show?number=172258

Ran into this while fuzz testing. These are the characters that are documented as letter-like, yet result in a RowBox[{“a”,”xxx”,”b”}] when typed into the FE:

[Dash]

[LongDash]

[Hyphen]

For example, typing a[Dash]b into the FE results in RowBox[{“a”, “[Dash]”, “b”}] Expected: was “a[Dash]b”

Table of contents

U+034F COMBINING GRAPHEME JOINER

U+0430 CYRILLIC SMALL LETTER A

U+1680

U+180E

U+2000

U+2001

U+2002

U+2003

U+2004

U+2005 FOUR-PER-EM-SPACE \[ThickSpace]

U+2006

U+2007

U+2008

U+2009 THIN SPACE \[ThinSpace]

U+200A HAIR SPACE \[VeryThinSpace]

U+200B ZERO WIDTH SPACE

U+200C ZERO WIDTH NON-JOINER

U+200D ZERO WIDTH JOINER

U+200E LEFT-TO-RIGHT-MARK

U+2018, U+2019, U+201C, U+201D Curly Quotes

U+2028 LINE SEPARATOR

U+2029

U+202A

U+202B

U+202C

U+202D

U+202E

U+202F

U+2043 HYPHEN BULLET

U+205F MEDIUM MATHEMATICAL SPACE

U+2060 WORD JOINER

U+2061 FUNCTION APPLICATION

U+2062 INVISIBLE TIMES

U+2063 INVISIBLE SEPARATOR

U+2064 INVISIBLE PLUS

U+20E5

U+2192 RIGHTWARDS ARROW

U+2215

U+2423 OPEN BOX

U+29F4 RULE DELAYED

U+3000 IDEOGRAPHIC SPACE

U+3001 IDEOGRAPHIC COMMA

Surrogates

FE - Kernel difference

U+034F `COMBINING GRAPHEME JOINER`

U+0430 `CYRILLIC SMALL LETTER A`

U+2005 `FOUR-PER-EM-SPACE` `\[ThickSpace]`

U+2009 `THIN SPACE` `\[ThinSpace]`

U+200A `HAIR SPACE` `\[VeryThinSpace]`

U+200B `ZERO WIDTH SPACE`

U+200C `ZERO WIDTH NON-JOINER`

U+200D `ZERO WIDTH JOINER`

U+200E `LEFT-TO-RIGHT-MARK`

U+2028 `LINE SEPARATOR`

U+2043 `HYPHEN BULLET`

U+205F `MEDIUM MATHEMATICAL SPACE`

U+2060 `WORD JOINER`

U+2061 `FUNCTION APPLICATION`

U+2062 `INVISIBLE TIMES`

U+2063 `INVISIBLE SEPARATOR`

U+2064 `INVISIBLE PLUS`

U+2192 `RIGHTWARDS ARROW`

U+2423 `OPEN BOX`

U+29F4 `RULE DELAYED`

U+3000 `IDEOGRAPHIC SPACE`

U+3001 `IDEOGRAPHIC COMMA`