Architecture of CodeParser
Overview
CodeParser is a parser for Wolfram Language code.
A lot of work has gone into making it bullet-proof.
It has gracious error handling and some value will be returned for all input strings.
This allows the parser to be used in editors as people type interactively.
I was motivated to write this parser because I think there has been a superstition that writing a WL parser would be hard because of several factors:
- so, so many operators
- crazy tokenization rules
- crazy character escape rules
- believing you would have to learn some specific parser-generator technology like YACC.
The parser is separated into various, well separated layers:
ByteDecoder (bytes -> SourceCharacters)
↓
CharacterDecoder (SourceCharacters -> WLCharacters)
↓
Tokenizer (WLCharacters -> Tokens)
↓
Parser (Tokens -> Concrete Syntax)
↓
Aggregate (Concrete Syntax -> Aggregate Syntax)
↓
Abstract (Aggregate Syntax -> Abstract Syntax)
Theory
CodeParser uses a design based on Pratt parsers and I was inspired by articles such as Pratt Parsers: Expression Parsing Made Easy where it really is shown that parsing can be easy.
Two large tables of prefix parselets and infix parselets are maintained and at any step in the parsing, there is a unique parselet to use, i.e., it is “predictive”.
Wolfram Language is essentially an LL(1) language and tools such as YACC are technically way more powerful than necessary.
Except for a few bits involving syntax such as:
;;
a : b : c
If a node is not something else, then it is leaf.
All nodes have a uniform structure:
Node[tag or operator, contents or children, data]
data
is just an Association
and stores information such as source location and custom data may be added. Adding custom data is useful for tagging nodes with types or simply attaching temporary metadata dor another pass.
We take advantage of the symbolic nature of WL and use the function symbols themselves for tags:
a+b
is parsed as:
InfixNode[Plus, {LeafNode[Symbol, "a", <||>], LeafNode[Token`Plus, "+", <||>], LeafNode[Symbol, "b", <||>]}, <||>]
a::b
is parsed as:
InfixNode[MessageName, {LeafNode[Symbol, "a", <||>], LeafNode[Token`ColonColon, "::", <||>], LeafNode[String, "b", <||>]}, <||>]
Building
I use CMake for building.
There are several tables of data that are used at build-time to generate WL and C++ code.
ByteDecoder
Bytes -> SourceCharacters
One interesting choice that I made was to treat \r\n
as a single SourceCharacter.
This is handled when UTF-8 bytes are decoded into characters and greatly simplifies logic in later stages.
In the Unicode standard, \r\n
is a grapheme cluster.
Character Encodings
UTF-8 input is assumed everywhere.
There is an API function SafeString
that will accept an array of bytes and return a “safe” string, i.e., a string that has assumed UTF-8 input with these changes:
-
Any invalid byte sequences are converted into
\[UnknownGlyph]
-
Any high or low surrogates are converted into
\[UnknownGlyph]
BOM character U+FEFF
is converted into U+E001
, to allow transferring through MathLink.
CharacterDecoder
SourceCharacters -> WLCharacters
If a WLCharacter is not something else, then it is letterlike.
This leads to some confusing parses sometimes, e.g. all of these characters are letterlike:
\b
\[RawEscape]
\[AliasDelimiter]
Escaped characters
All \
sequences are separate WL characters.
\b
\f
\n
\r
\t
\"
\\
This leads to a clean design.
Raw
WLCharacters like \[RawReturn]
are a way of escaping that character. These are poorly understood and perhaps essentially unused.
A good philosophy that I follow is to treat the Raw characters as escaped versions of their normal characters.
The single codepoint U+0009
is not the same as the sequence of SourceCharacters \t
.
The single codepoint U+000A
is not the same as the sequence of SourceCharacters \n
.
Private Use Area
No attempt will be made to define or describe characters in the PUA.
The notebook front-end defines a number of PUA characters for its own internal use.
This is not a binding contract and usage, values, behavior, and stability is subject to change at any moment.
Categories
Each WLCharacter belongs to a category:
RawCharacter
LetterlikeCharacter
WhitespaceCharacter
NewlineCharacter
PunctuationCharacter
UninterpretableCharacter
UnsupportedCharacter
The term PunctuationCharacter
is used here instead of Operator because the term Operator implies knowledge at the Token level.
A Character is Punctuation.
A Token is an Operator.
UnsupportedCharacters are LongNames that are in an intermediate stage of support within WL.
Perhaps they have been deprecated, perhaps the Front End understands the character but the kernel doesn’t, etc.
ASCII Replacements are derived from internal Kernel sources and various other sources such as symbolTable with some modifications and additions.
Aside: Line continuations
Conceptually, line continuations are handled on their own layer.
ByteDecoder (bytes -> SourceCharacters)
↓
ContinuationMerger (SourceCharacters -> MergedSourceCharacters)
↓
CharacterDecoder (MergedSourceCharacters -> WLCharacters)
Tokenizer
WLCharacters -> Tokens
Tokenizing WL is more complicated than actual parsing!
If a token is not something else, then it is prefix.
All LongName operators are a separate token, so this means that there are a lot of tokens!
Concrete Syntax
Grammar productions are not included, this is similar to hidden rules in tree-sitter.
Type information is added.
Type information is the wrapper like InfixNode[Plus, ...]
.
Implicit tokens are added.
Aside: What are Implicit tokens?
When parsing ;;
, it is convenient to remember the implicit tokens 1 ;; All
.
When parsing a; ;
, it is convenient to remember the implicit tokens a ; Null ; Null
.
When parsing a b
, it is convenient to remember the implicit tokens a * b
.
Aggregate Syntax
The next stage is removing trivia from Concrete Syntax and calling the result Aggregate Syntax.
Type information is kept and implicit tokens are kept.
Aside: What is trivia?
The term “trivia” is taken from Roslyn docs.
Trivia is comments, whitespace, and newlines.
Trivia is only ever RIFFLED between tokens, never at the beginning or end.
Abstract Syntax
The final stage is converting aggregate syntax into abstract syntax.
After abstracting, everything is a CallNode.
Type information is lost because everything is a CallNode.
Implicit tokens are converted to actual tokens.
Aside: Boxes
The notebook front-end box language can also be handled with the CodeConcreteParseBox
function:
RowBox[{"1", "+", RowBox[{"(*", "*)"}], "a"}]
RowBox[{"1", "+", RowBox[{"(*", "*)"}], SqrtBox["a"]}]
The tree structure of the input is preserved and additional information is added, such as implicit tokens.
Performance
Performance is not an explicit goal. I have focused on correctness and extensibility over performance.
There is an open issue for addressing performance.
Testing
There is an extensive test suite that is used for regression testing.
I have also extensively used AFL for fuzz testing and I have found and fixed several bugs triggered by fuzzing!