Architecture of CodeParser

Overview

CodeParser is a parser for Wolfram Language code.

A lot of work has gone into making it bullet-proof.

It has gracious error handling and some value will be returned for all input strings.

This allows the parser to be used in editors as people type interactively.

I was motivated to write this parser because I think there has been a superstition that writing a WL parser would be hard because of several factors:

so, so many operators
crazy tokenization rules
crazy character escape rules
believing you would have to learn some specific parser-generator technology like YACC.

The parser is separated into various, well separated layers:

ByteDecoder (bytes -> SourceCharacters)
↓
CharacterDecoder (SourceCharacters -> WLCharacters)
↓
Tokenizer (WLCharacters -> Tokens)
↓
Parser (Tokens -> Concrete Syntax)
↓
Aggregate (Concrete Syntax -> Aggregate Syntax)
↓
Abstract (Aggregate Syntax -> Abstract Syntax)

Theory

CodeParser uses a design based on Pratt parsers and I was inspired by articles such as Pratt Parsers: Expression Parsing Made Easy where it really is shown that parsing can be easy.

Two large tables of prefix parselets and infix parselets are maintained and at any step in the parsing, there is a unique parselet to use, i.e., it is “predictive”.

Wolfram Language is essentially an LL(1) language and tools such as YACC are technically way more powerful than necessary.

Except for a few bits involving syntax such as:

;;
a : b : c

If a node is not something else, then it is leaf.

All nodes have a uniform structure: Node[tag or operator, contents or children, data]

data is just an Association and stores information such as source location and custom data may be added. Adding custom data is useful for tagging nodes with types or simply attaching temporary metadata dor another pass.

We take advantage of the symbolic nature of WL and use the function symbols themselves for tags:

a+b is parsed as:

InfixNode[Plus, {LeafNode[Symbol, "a", <||>], LeafNode[Token`Plus, "+", <||>], LeafNode[Symbol, "b", <||>]}, <||>]

a::b is parsed as:

InfixNode[MessageName, {LeafNode[Symbol, "a", <||>], LeafNode[Token`ColonColon, "::", <||>], LeafNode[String, "b", <||>]}, <||>]

Building

I use CMake for building.

There are several tables of data that are used at build-time to generate WL and C++ code.

ByteDecoder

Bytes -> SourceCharacters

One interesting choice that I made was to treat \r\n as a single SourceCharacter.

This is handled when UTF-8 bytes are decoded into characters and greatly simplifies logic in later stages.

In the Unicode standard, \r\n is a grapheme cluster.

Character Encodings

UTF-8 input is assumed everywhere.

There is an API function SafeString that will accept an array of bytes and return a “safe” string, i.e., a string that has assumed UTF-8 input with these changes:

Any invalid byte sequences are converted into \[UnknownGlyph]
Any high or low surrogates are converted into \[UnknownGlyph]

BOM character U+FEFF is converted into U+E001, to allow transferring through MathLink.

CharacterDecoder

SourceCharacters -> WLCharacters

If a WLCharacter is not something else, then it is letterlike.

This leads to some confusing parses sometimes, e.g. all of these characters are letterlike:

\b

\[RawEscape]

\[AliasDelimiter]

Escaped characters

All \ sequences are separate WL characters.

\b
\f
\n
\r
\t
\"
\\

This leads to a clean design.

Raw

WLCharacters like \[RawReturn] are a way of escaping that character. These are poorly understood and perhaps essentially unused.

A good philosophy that I follow is to treat the Raw characters as escaped versions of their normal characters.

The single codepoint U+0009 is not the same as the sequence of SourceCharacters \t.

The single codepoint U+000A is not the same as the sequence of SourceCharacters \n.

Private Use Area

No attempt will be made to define or describe characters in the PUA.

The notebook front-end defines a number of PUA characters for its own internal use.

This is not a binding contract and usage, values, behavior, and stability is subject to change at any moment.

Aside: Line continuations

Conceptually, line continuations are handled on their own layer.

ByteDecoder (bytes -> SourceCharacters)
↓
ContinuationMerger (SourceCharacters -> MergedSourceCharacters)
↓
CharacterDecoder (MergedSourceCharacters -> WLCharacters)

Tokenizer

WLCharacters -> Tokens

Tokenizing WL is more complicated than actual parsing!

If a token is not something else, then it is prefix.

All LongName operators are a separate token, so this means that there are a lot of tokens!

Concrete Syntax

Grammar productions are not included, this is similar to hidden rules in tree-sitter.

Type information is added.

Type information is the wrapper like InfixNode[Plus, ...].

Implicit tokens are added.

Aside: What are Implicit tokens?

When parsing ;;, it is convenient to remember the implicit tokens 1 ;; All.

When parsing a; ;, it is convenient to remember the implicit tokens a ; Null ; Null.

When parsing a b, it is convenient to remember the implicit tokens a * b.

Aggregate Syntax

The next stage is removing trivia from Concrete Syntax and calling the result Aggregate Syntax.

Type information is kept and implicit tokens are kept.

Aside: What is trivia?

The term “trivia” is taken from Roslyn docs.

Trivia is comments, whitespace, and newlines.

Trivia is only ever RIFFLED between tokens, never at the beginning or end.

Abstract Syntax

The final stage is converting aggregate syntax into abstract syntax.

After abstracting, everything is a CallNode.

Type information is lost because everything is a CallNode.

Implicit tokens are converted to actual tokens.

Aside: Boxes

The notebook front-end box language can also be handled with the CodeConcreteParseBox function:

RowBox[{"1", "+", RowBox[{"(*", "*)"}], "a"}]

RowBox[{"1", "+", RowBox[{"(*", "*)"}], SqrtBox["a"]}]

The tree structure of the input is preserved and additional information is added, such as implicit tokens.

Performance

Performance is not an explicit goal. I have focused on correctness and extensibility over performance.

There is an open issue for addressing performance.

Testing

There is an extensive test suite that is used for regression testing.

I have also extensively used AFL for fuzz testing and I have found and fixed several bugs triggered by fuzzing!