Troublesome characters: 8-bit

8-bit characters that cause some amount of trouble.

Codepoint	Description	WL
U+0000 - U+001F
U+0020	Space
U+0027	’
U+002E	.
U+007F	DEL
U+0080 - U+009F
U+00A0 - U+00FF

ASCII control (and DEL)

note: U+0000 - U+001F, U+007F

LF CR TAB

\r is slightly stranger than \n

macOS line ending

CRLF is a grapheme cluster https://unicode.org/reports/tr29/#Table_Combining_Char_Sequences_and_Grapheme_Clusters

“strange” ASCII characters C0

Most C0 characters are letterlike (except .07)

must be careful that test for control characters is not just 0x00 to 00x1F

testing for control should also contain DEL

DEL DEL iscntrl

Unicode only specifies semantics for U+0009—U+000D, U+001C—U+001F, and U+0085. The rest of the control codes are transparent to Unicode and their meanings are left to higher-level protocols

U+0000 `\.00`

big headache

it is letterlike

ML encoding

U+0007 `\.07`

BEL

Not interpretable

U+0008 `\b`

BACKSPACE U+0008 is pretty troublesome in notebooks https://bugs.wolfram.com/show?number=379004

invisible?

screenshot of Paclet Development notebook on cloud

U+000B

Line Tabulation (\v) -

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+000C `\f`

\u000C - Form Feed (\f) -

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+001B `\[RawEscape]`

used with Ansi escape codes

RawEscape

invisible?

U+007F

DEL

often handled with C0, but not technically C0

invisible?

plain ASCII

plain ascii characters

what does isgraph() return?

SPACE

Causes the Least Trouble

graphical ASCII characters

punctuation

. is troublesome because of typos (intended ,)

find bug where there was a . instead of ,

’ is troublesome because of typos (intended ;)

. and ‘

. is next to , on the keyboard

’ is next to ; on the keyboard

accidentally doing:

f[a . b]

instead of

f[a, b]

accidentally doing

f[ g[]’ h[] ]

instead of doing:

f[ g[]; h[] ]

these are valid syntax!

numbers

letters

C1

post #3

non-ASCII control C1 U+0080 - U+009F

all invisible? how should C1 be rendered?

where is this specified in Unicode spec?

Unicode only specifies semantics for U+0009—U+000D, U+001C—U+001F, and U+0085. The rest of the control codes are transparent to Unicode and their meanings are left to higher-level protocols

There are these characters: https://en.wikipedia.org/wiki/Control_Pictures

{#, IntegerString[ToCharacterCode[#], 16, 4]}& /@ CharacterRange[ToExpression[”"\.80"”], ToExpression[”"\.9f"”]] // Column

some of C1 have glyphs:

U+0080

has a glyph

looks like Euro

http://archives.miloush.net/michkap/archive/2005/10/26/484481.html

Euro in Windows1252

U+0081

has a glyph

CenterDot?

it’s a little square

U+0082

has a glyph

little hyphen?

U+0084

has a glyph

long underscore?

U+008e

has a glyph

CapitalZHacek

CapitalZHacek in Windows1252

U+0093

has a glyph

OpenCurlyDoubleQuote

OpenCurlyDoubleQuote in Windows1252

U+0094

has a glyph

CloseCurlyDoubleQuote

CloseCurlyDoubleQuote in Windows1252

U+0097

has a glyph

some kind of space?

U+009e

has a glyph

ZHacek

ZHacek in Windows1252

U+0085

\u0085 - Next Line

defined to have semantics

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

U+0083, U+0086, U+0087, U+0088, U+0089, U+008a, U+008b, U+008c, U+008d, U+008f, U+0090, U+0091, U+0092, U+0095, U+0096, U+0098, U+0099, U+009a, U+009b, U+009c, U+009d, U+009f

no semantics

no glyphs

include my notes

include screenshots

Misc

major source of mojibake

not C1

note: U+00A0 - U+00FF

8-bit

Target receipt different code pages mojibake

U+00A0

\[NonBreakingSpace] U+00A0

https://eslint.org/docs/rules/no-irregular-whitespace

invisible

Table of contents

ASCII control (and DEL)

U+0000 \.00

U+0007 \.07

U+0008 \b

U+000B

U+000C \f

U+001B \[RawEscape]

U+007F

plain ASCII

C1

U+0080

U+0081

U+0082

U+0084

U+008e

U+0093

U+0094

U+0097

U+009e

U+0085

U+0083, U+0086, U+0087, U+0088, U+0089, U+008a, U+008b, U+008c, U+008d, U+008f, U+0090, U+0091, U+0092, U+0095, U+0096, U+0098, U+0099, U+009a, U+009b, U+009c, U+009d, U+009f

Misc

U+00A0

U+0000 `\.00`

U+0007 `\.07`

U+0008 `\b`

U+000C `\f`

U+001B `\[RawEscape]`