Clever

I deliberately try to not be “clever” with code that I write, but sometimes terseness can be mistaken for being too clever. So I try to add comments and copious documentation for what was going through my mind when I wrote it.

One example of code that I’m proud of is from my time at Wolfram Research. We were working on an update to webMathematica, the product that forms the core of Wolfram’s online product suite and the backbone of Wolfram|Alpha.

There was a layer of Java that was handling the actual web traffic and converting into Mathematica code to be fed into the server. The parser we were using was adapted from generated code from a parser tool (ANTLR, I believe) and would occasionally throw StackOverflowError on parsing strings that didn’t seem that complicated! We were in a mode of hardening this product, and so we wanted to make sure that this problem was completely fixed.

After a lot of investigation, I tracked down the problem: Java’s regex engine itself was doing way more work than it should be and throwing errors on strings that were “long”, but not pathological. I kept this initial listing in the repo to serve as a warning that using regexes would be doomed:

/*
 * cannot use java.util.regex for matching strings with
 * escape characters.
 * Java implements the matching methods recursively, even
 * when they don't need to,
 * so certain regexes (even with no backtracking) will
 * cause StackOverflowError
 * see:
 * http://blog.headius.com/2006/09/java-regex-broken-for-alternation-over.html
 */

http://blog.headius.com/2006/09/java-regex-broken-for-alternation-over.html

Including references is always important when claiming that your programming language is broken. So, the challenge was to be able to parse strings (possibly with escaped quotes inside) without using Java’s built-in regex engine.

I have elided some of the irrelevant surrounding code:

/**
 * Advances the state of the tokenizer to the next token
 * 
 * @throws ArrayIndexOutOfBoundsException if there is no next token
 */
public void next() {
  switch (chars[curPos]) {
  ...
  case '\"':
    for (int i = curPos+1; i < len; i++) {
      if (chars[i] == '"') {
        /*
         * j is initialized to i-1, and then decremented for
         * each backslash found.
         *
         * if an even number of backslashes are found,
         * then j stays congruent to i-1 mod 2,
         * meaning (i - j) % 2 == 1.
         *
         * if an odd number of backslashes are found,
         * then j is congruent to i mod 2,
         * meaning (i - j) % 2 == 0
         */
        int j = i-1;
        while (chars[j] == '\\') {
          j--;
        }
        if ((i - j) % 2 == 1) {
          /*
           * found even number of backslashes,
           * so the quote ends a string
           */
          curCode = TokenCode.STRING;
          curText = String.copyValueOf(chars, curPos, i+1-curPos);
          curPos = i+1;
          return;
        }
      }
    }
    throw new ExprParsingException(String.format("Unbalanced \" at " +
      "position %d. Source: %s", curPos, source));
  ...
  }
}

The idea for this string tokenizing is simple parser theory, but sometimes gets forgotten when you are using regexes to do the matching. When the tokenizer sees a " character for the first time, it enters that case statement above. It then looks at all of the next characters until another " is found. Naively, a string has now been found. But! This does not necessarily mean that we have tokenized a string. There may be a \ character preceding the ", preventing that " from counting as the end of the string. But! There may be another \ before the first \, making an escaped backslash and a regular " to end the string. This pattern alternates between escaping the " or escaping the \ characters before it. So, it is a matter of counting backwards and through the preceding \ characters and doing the right thing based on whether there were an even or odd number of preceding \ characters.

The code allows strings to be parsed without the stack blowing up unnecessarily (in fact the stack does not grow at all). It is small and efficient and I still think about how nice this code so many years later.