Yet another Wolfram Language parser!

I am a big fan of having multiple, independent parsers available for the Wolfram Language. There are 2 separate official parsers in the Mathematica product: one for the kernel and one for the notebook front end. Other products sold by Wolfram such as Workbench have their own parser. Open-source projects such as Mathics and Wolfram Language IntelliJ Plugin also have their own parsers. I also created my own parser CodeParser that underpins my work with Wolfram Language static analysis, formatting, and LSP implementation.

Well, Tree-sitter is a parser generator tool that is growing in popularity and I have started to write a Wolfram Language grammar for Tree-sitter as well.

Who uses Tree-sitter?

The recently announced vscode.dev project is a lightweight version of VS Code running fully in the browser. vscode.dev uses Tree-sitter to provide additional experiences such as Outline/Go to Symbol and Symbol Search.

The Semantic Code team at GitHub uses Tree-sitter to power its static analysis.

Atom and Semgrep use Tree-sitter for their language support.

I am going to largely follow the same structure and design decisions as CodeParser when I am designing the grammar. This will allow better comparisons between the 2 parsers and keep things consistent.

There is a large TODO list and these are the things that immediately come to mind.

  • flat infix operators

I would like a + b + c

(source_file
  (infix_expression
    (identifier)
    (identifier)
    (identifier)))

This may be achievable with hidden rules, but I have not yet figured out how.

Related Tree-sitter discussion

This is needed because there can be very large expressions with + or *, single expressions that span hundreds of lines.

  • ternary operators
a ~ f ~ b

a : b : c

a /: b := c
  • special operators

Operators such as ;; have a complicated syntax that must be handled specially.

  • Real number syntax

Real number syntax in WL is very complicated. For example, this is a real number literal in WL:

-1.23``45*^-67
  • implicit Times

Any juxtaposed expressions that do not otherwise parse are considered to be multiplied together. This is convenient for symbols: a b is equivalent to a * b, but this must work for all expressions.

  • nested comments

  • stringifying operators

Operators such as a::b and <<a actually parse their arguments as strings.

Tree-sitter seems to have something with token.immediate to be able to skip extras and that may work here.

  • fill in long names

WL has syntax for characters such as \[Alpha] and these must be lexed correctly.

It may make sense to bring in an external tokenizer.

  • fill in ErrorNodes

  • fill in SyntaxErrorNodes

  • fill in linear syntax

  • fill in Integral syntax

Check out progress here: https://github.com/bostick/tree-sitter-wolfram

Updated: