Troublesome characters: 8-bit
8-bit characters that cause some amount of trouble.
Table of contents
Codepoint | Description | WL |
---|---|---|
U+0000 - U+001F | ||
U+0020 | Space | |
U+0027 | ’ | |
U+002E | . | |
U+007F | DEL | |
U+0080 - U+009F | ||
U+00A0 - U+00FF |
ASCII control (and DEL)
note: U+0000 - U+001F, U+007F
LF CR TAB
\r is slightly stranger than \n
macOS line ending
CRLF is a grapheme cluster https://unicode.org/reports/tr29/#Table_Combining_Char_Sequences_and_Grapheme_Clusters
“strange” ASCII characters C0
Most C0 characters are letterlike (except .07)
must be careful that test for control characters is not just 0x00 to 00x1F
testing for control should also contain DEL
DEL DEL iscntrl
Unicode only specifies semantics for U+0009—U+000D, U+001C—U+001F, and U+0085. The rest of the control codes are transparent to Unicode and their meanings are left to higher-level protocols
U+0000 \.00
big headache
it is letterlike
ML encoding
U+0007 \.07
BEL
Not interpretable
U+0008 \b
BACKSPACE U+0008 is pretty troublesome in notebooks https://bugs.wolfram.com/show?number=379004
invisible?
screenshot of Paclet Development notebook on cloud
U+000B
Line Tabulation (\v) -
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+000C \f
\u000C - Form Feed (\f) -
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+001B \[RawEscape]
used with Ansi escape codes
RawEscape
invisible?
U+007F
DEL
often handled with C0, but not technically C0
invisible?
plain ASCII
plain ascii characters
what does isgraph() return?
SPACE
Causes the Least Trouble
graphical ASCII characters
punctuation
. is troublesome because of typos (intended ,)
find bug where there was a . instead of ,
’ is troublesome because of typos (intended ;)
. and ‘
. is next to , on the keyboard
’ is next to ; on the keyboard
accidentally doing:
f[a . b]
instead of
f[a, b]
accidentally doing
f[ g[]’ h[] ]
instead of doing:
f[ g[]; h[] ]
these are valid syntax!
numbers
letters
C1
post #3
non-ASCII control C1 U+0080 - U+009F
all invisible? how should C1 be rendered?
where is this specified in Unicode spec?
Unicode only specifies semantics for U+0009—U+000D, U+001C—U+001F, and U+0085. The rest of the control codes are transparent to Unicode and their meanings are left to higher-level protocols
There are these characters: https://en.wikipedia.org/wiki/Control_Pictures
{#, IntegerString[ToCharacterCode[#], 16, 4]}& /@ CharacterRange[ToExpression[”"\.80"”], ToExpression[”"\.9f"”]] // Column
some of C1 have glyphs:
U+0080
has a glyph
looks like Euro
http://archives.miloush.net/michkap/archive/2005/10/26/484481.html
Euro in Windows1252
U+0081
has a glyph
CenterDot?
it’s a little square
U+0082
has a glyph
little hyphen?
U+0084
has a glyph
long underscore?
U+008e
has a glyph
CapitalZHacek
CapitalZHacek in Windows1252
U+0093
has a glyph
OpenCurlyDoubleQuote
OpenCurlyDoubleQuote in Windows1252
U+0094
has a glyph
CloseCurlyDoubleQuote
CloseCurlyDoubleQuote in Windows1252
U+0097
has a glyph
some kind of space?
U+009e
has a glyph
ZHacek
ZHacek in Windows1252
U+0085
C1
\u0085 - Next Line
defined to have semantics
https://eslint.org/docs/rules/no-irregular-whitespace
invisible
U+0083, U+0086, U+0087, U+0088, U+0089, U+008a, U+008b, U+008c, U+008d, U+008f, U+0090, U+0091, U+0092, U+0095, U+0096, U+0098, U+0099, U+009a, U+009b, U+009c, U+009d, U+009f
C1
no semantics
no glyphs
include my notes
include screenshots
Misc
major source of mojibake
not C1
note: U+00A0 - U+00FF
8-bit
Target receipt different code pages mojibake
U+00A0
\[NonBreakingSpace] U+00A0
https://eslint.org/docs/rules/no-irregular-whitespace
invisible