Creating a Wolfram Language grammar for Tree-sitter
Yet another Wolfram Language parser!
I am a big fan of having multiple, independent parsers available for the Wolfram Language. There are 2 separate official parsers in the Mathematica product: one for the kernel and one for the notebook front end. Other products sold by Wolfram such as Workbench have their own parser. Open-source projects such as Mathics and Wolfram Language IntelliJ Plugin also have their own parsers. I also created my own parser CodeParser that underpins my work with Wolfram Language static analysis, formatting, and LSP implementation.
Well, Tree-sitter is a parser generator tool that is growing in popularity and I have started to write a Wolfram Language grammar for Tree-sitter as well.
Who uses Tree-sitter?
The recently announced vscode.dev project is a lightweight version of VS Code running fully in the browser. vscode.dev uses Tree-sitter to provide additional experiences such as Outline/Go to Symbol and Symbol Search.
The Semantic Code team at GitHub uses Tree-sitter to power its static analysis.
Atom and Semgrep use Tree-sitter for their language support.
I am going to largely follow the same structure and design decisions as CodeParser when I am designing the grammar. This will allow better comparisons between the 2 parsers and keep things consistent.
There is a large TODO list and these are the things that immediately come to mind.
- flat infix operators
I would like a + b + c
(source_file
(infix_expression
(identifier)
(identifier)
(identifier)))
This may be achievable with hidden rules, but I have not yet figured out how.
Related Tree-sitter discussion
This is needed because there can be very large expressions with +
or *
, single expressions that span hundreds of lines.
- ternary operators
a ~ f ~ b
a : b : c
a /: b := c
- special operators
Operators such as ;;
have a complicated syntax that must be handled specially.
- Real number syntax
Real number syntax in WL is very complicated. For example, this is a real number literal in WL:
-1.23``45*^-67
- implicit Times
Any juxtaposed expressions that do not otherwise parse are considered to be multiplied together. This is convenient for symbols: a b
is equivalent to a * b
, but this must work for all expressions.
-
nested comments
-
stringifying operators
Operators such as a::b
and <<a
actually parse their arguments as strings.
Tree-sitter seems to have something with token.immediate
to be able to skip extras and that may work here.
- fill in long names
WL has syntax for characters such as \[Alpha]
and these must be lexed correctly.
It may make sense to bring in an external tokenizer.
-
fill in ErrorNodes
-
fill in SyntaxErrorNodes
-
fill in linear syntax
-
fill in
Integral
syntax
Check out progress here: https://github.com/bostick/tree-sitter-wolfram