Troublesome characters: Before PUA
Multi-byte characters before PUA that cause some amount of trouble.
note: PUA is U+0100 - U+DFFF
note: Standardized (non-PUA) (displayable in all fonts)
Table of contents
Codepoint | Description | WL |
---|---|---|
U+034F | COMBINING GRAPHEME JOINER | |
U+0430 | CYRILLIC SMALL LETTER A | |
U+1680 | Ogham Space Mark | |
U+180E | Mongolian Vowel Separator | |
U+2000 | En Quad | |
U+2001 | Em Quad | |
U+2002 | En Space | |
U+2003 | Em Space | |
U+2004 | Three-Per-Em | |
U+2005 | FOUR-PER-EM-SPACE | \[ThickSpace] |
U+2006 | Six-Per-Em | |
U+2007 | Figure Space | |
U+2008 | Punctuation Space | |
U+2009 | THIN SPACE | \[ThinSpace] |
U+200A | HAIR SPACE | \[VeryThinSpace] |
U+200B | ZERO WIDTH SPACE | |
U+200C | ZERO WIDTH NON-JOINER | |
U+200D | ZERO WIDTH JOINER | |
U+200E | LEFT-TO-RIGHT-MARK | |
U+2010 | HYPHEN | \[Hyphen] |
U+2013 | DASH | \[Dash] |
U+2014 | LONG DASH | \[LongDash] |
U+2018 | Open Curly Quote | \[OpenCurlyQuote] |
U+2019 | Close Curly Quote | \[CloseCurlyQuote] |
U+201C | Open Curly Double Quote | \[OpenCurlyDoubleQuote] |
U+201D | Close Curly Double Quote | \[CloseCurlyDoubleQuote] |
U+2028 | LINE SEPARATOR | |
U+2029 | Paragraph Separator | |
U+202A | LEFT-TO-RIGHT EMBEDDING | |
U+202B | RIGHT-TO-LEFT EMBEDDING | |
U+202C | POP DIRECTIONAL FORMATTING | |
U+202D | LEFT-TO-RIGHT OVERRIDE | |
U+202E | RIGHT-TO-LEFT OVERRIDE | |
U+202F | Narrow No-Break Space | |
U+2043 | HYPHEN BULLET | \[SkeletonIndicator] |
U+205F | MEDIUM MATHEMATICAL SPACE | \[MediumSpace] |
U+2060 | WORD JOINER | \[NoBreak] |
U+2061 | FUNCTION APPLICATION | |
U+2062 | INVISIBLE TIMES | \[InvisibleTimes] |
U+2063 | INVISIBLE SEPARATOR | |
U+2064 | INVISIBLE PLUS | |
U+20E5 | ||
U+2192 | RIGHTWARDS ARROW | |
U+2215 | ||
U+2423 | OPEN BOX | \[SpaceIndicator] |
U+29F4 | RULE DELAYED | |
U+3000 | IDEOGRAPHIC SPACE | |
U+3001 | IDEOGRAPHIC COMMA | |
U+D800 - U+DBFF | ||
U+DC00 - U+DFFF |
U+034F COMBINING GRAPHEME JOINER
No visible glpyh
https://twitter.com/wilbowma/status/1383910966748803075
invisible
U+0430 CYRILLIC SMALL LETTER A
https://bugs.wolfram.com/show?number=401640
U+1680
\u1680 - Ogham Space Mark
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+180E
\u180E - Mongolian Vowel Separator -
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+2000
\u2000 - En Quad
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+2001
\u2001 - Em Quad
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+2002
\u2002 - En Space -
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+2003
\u2003 - Em Space -
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+2004
\u2004 - Three-Per-Em
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+2005 FOUR-PER-EM-SPACE
\[ThickSpace]
U+2005: \[ThickSpace]
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+2006
\u2006 - Six-Per-Em
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+2007
\u2007 - Figure Space
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+2008
\u2008 - Punctuation Space -
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+2009 THIN SPACE
\[ThinSpace]
U+2009: \[ThinSpace]
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+200A HAIR SPACE
\[VeryThinSpace]
U+200a: \[VeryThinSpace]
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+200B ZERO WIDTH SPACE
U+200b: ZERO WIDTH SPACE
not the same thing as U+F360 \[InvisibleSpace]
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+200C ZERO WIDTH NON-JOINER
U+200c: ZERO WIDTH NON-JOINER
invisible
actually the cause of a lot of the iOS crashing bugs:
“Telugu Text bomb”
https://manishearth.github.io/blog/2018/02/15/picking-apart-the-crashing-ios-string/ U+0C1C U+0C4D U+0C1E U+200C U+0C3E, which is a sequence of Telugu characters: the consonant ja (జ), a virama ( ్ ), the consonant nya (ఞ), a zero-width non-joiner, and the vowel aa ( ా
U+200D ZERO WIDTH JOINER
U+200d: ZERO WIDTH JOINER
ZERO WIDTH JOINER
is really its own thing
https://tonsky.me/blog/emoji/
invisible
U+200E LEFT-TO-RIGHT-MARK
have the semantics of an invisible character of zero width
invisible
U+2018, U+2019, U+201C, U+201D Curly Quotes
THESE ARE VERY TROUBLESOME!!
“OpenCurlyQuote” -> {PunctuationCharacter, 16^^2018, < | “ASCIIReplacements” -> {“’”} | >}, |
“CloseCurlyQuote” -> {PunctuationCharacter, 16^^2019, < | “ASCIIReplacements” -> {“’”} | >}, |
“OpenCurlyDoubleQuote” -> {PunctuationCharacter, 16^^201c, < | “ASCIIReplacements” -> {“"”} | >}, |
“CloseCurlyDoubleQuote” -> {PunctuationCharacter, 16^^201d, < | “ASCIIReplacements” -> {“"”} | >}, |
copy / pasting from Word and text editors
U+2028 LINE SEPARATOR
// // LINE SEPARATOR // // case 0x2028: // return true;
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+2029
\u2029 - Paragraph Separator
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+202A
https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html
U+202B
https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html
U+202C
https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html
U+202D
https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html
U+202E
https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html
U+202F
\u202F - Narrow No-Break Space
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+2043 HYPHEN BULLET
“SkeletonIndicator” -> {UninterpretableCharacter, 16^^2043, < | “ASCIIReplacements” -> {“-“} | >}, |
U+205F MEDIUM MATHEMATICAL SPACE
\[MediumSpace] U+205f
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+2060 WORD JOINER
// // WORD JOINER //
\[NoBreak] U+2060 // This is the character that is recommended to use for ZERO WIDTH NON-BREAKING SPACE // https://unicode.org/faq/utf_bom.html#bom6 // // case 0x2060: // return true;
invisible
U+2061 FUNCTION APPLICATION
FUNCTION APPLICATION U+2061
invisible
not the same as U+F76D \[InvisibleApplication]
where does U+2061 (actual Function Application character) come from?
⁡
Peter Fleck customer
https://tex.stackexchange.com/questions/552692/unicode-character-u2061-inputenc-not-set-up-for-use-with-latex
it comes from MathML:
In[2]:= ExportString[f[x], “MathML”] Out[2]=
</math>
10^^8289 === 16^^2061
and AssembleFunctionCall in TypesetInit.m
U+2062 INVISIBLE TIMES
INVISIBLE TIMES
\[InvisibleTimes] U+2062
invisible
U+2063 INVISIBLE SEPARATOR
INVISIBLE SEPARATOR
invisible
not the same thing as U+F765 \[InvisibleComma]
U+2064 INVISIBLE PLUS
INVISIBLE PLUS
invisible
not the same thing as U+F39E \[ImplicitPlus]
U+20E5
Some important characters have also “alternatives” in Unicode:
Windows directory separator, \ (U+005C): U+20E5, U+FF3C
UNIX directory separator, / (U+002F): U+2215, U+FF0F
Parent directory, .. (U+002E, U+002E): U+FF0E
U+2192 RIGHTWARDS ARROW
confusible with [Rule]
U+2215
Some important characters have also “alternatives” in Unicode:
Windows directory separator, \ (U+005C): U+20E5, U+FF3C
UNIX directory separator, / (U+002F): U+2215, U+FF0F
Parent directory, .. (U+002E, U+002E): U+FF0E
U+2423 OPEN BOX
\[SpaceIndicator] U+2423
U+29F4 RULE DELAYED
confusible with [RuleDelayed]
U+3000 IDEOGRAPHIC SPACE
“COMPATIBILITYKanjiSpace” -> {UnsupportedCharacter, 16^^3000, < | >}, |
\:3000 Ideographic Space
accidentally added by Japanese translations sometimes
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+3001 IDEOGRAPHIC COMMA
\:3001 Ideographic Comma
accidentally added by Japanese translations sometimes
related to confusables: https://www.unicode.org/Public/security/8.0.0/confusables.txt http://www.unicode.org/reports/tr39/#Confusable_Detection http://www.unicode.org/Public/security/latest/confusables.txt
Surrogates
Encoding problems:
stray surrogates D800–DBFF DC00–DFFF
invisible?
https://unicodebook.readthedocs.io/issues.html#strict-utf8-decoder
Surrogates characters are also invalid in UTF-8: characters in U+D800—U+DFFF have to be rejected.
FE - Kernel difference
troublesome characters
https://bugs.wolfram.com/show?number=172258
Ran into this while fuzz testing. These are the characters that are documented as letter-like, yet result in a RowBox[{“a”,”xxx”,”b”}] when typed into the FE:
[Dash]
[LongDash]
[Hyphen]
For example, typing a[Dash]b into the FE results in RowBox[{“a”, “[Dash]”, “b”}] Expected: was “a[Dash]b”