Patch Parsers Properly

How to update a parser without introducing tech debt

#osh, #parsing | December 03, 2025

osh is a cool shell that's trying to become compatible with bash. The mdev-conf alpine linux package was failing to build with osh as the system shell. Andy and Bram from the Oils project traced the root cause to the following bug in osh's here doc parser:

$ cat bug.sh
# don't understand this syntax? Me neither. Keep reading and I cover it! 
cat <<EOF
a \"quote\"
EOF

$ osh bug.sh
a "quote"

$ bash bug.sh
a \"quote\"

Interestingly enough, toysh, sush, and brush all fail in the same way. This seems to be the kind of thing you need to make your shell mature.

I think it's an interesting distinction, so let's run through how this bug was fixed & why it was done in a specific way.

1. Where doc?
2. Initial fix: just the parser
3. Second attempt: both parser & lexer
4. What not to do

Where doc?

A here document (commonly shortened to here doc) is like a multiline string from the olden days. Seriously, you're gonna say "oh yeah, that sounds like what they'd do back then."

A here doc starts with << followed by a single word. Like <<MY_END_TOKEN. After that, all lines of text until MY_END_TOKEN\n will comprise a large string.

$ <<MY_FANCY_END_TOKEN
This is my cool essay where I can "quote things" without needing to escape them.
Oh the freedom...
MY_FANCY_END_TOKEN

The only purpose of a here doc is that, when placed after a command, its contents become the command's stdin. A little known fact is that if cat is not given an argument, it will echo its stdin. Thus, the easiest way to see the contents of a here doc is to cat it. Of course, we can use the pipe to convert stdout into stdin.

$ echo "please cat this" | cat
please cat this

Now, we can display the contents of a here doc using cat:

$ cat <<MY_FANCY_END_TOKEN
This is my cool essay where I can "quote things" without needing to escape them.
Oh the freedom...
MY_FANCY_END_TOKEN
This is my cool essay where I can "quote things" without needing to escape them.
Oh the freedom...

Variables even get expanded in here docs:

$ x=10
$ y=$((x*2))
$ cat <<END
$x * 2 = $y
END
10 * 2 = 20

Initial fix: just the parser

Now that we've learned some cursed bash knowledge, lets fix osh's parser. Game plan:

Dive into osh's parser and figure out where it converts \" into ".
Don't.

Seems simple enough?

I was quickly able to find where here docs are parsed. It mostly shares functionality with the parsing of double quote strings.

def ReadHereDocBody(self, parts):
    # type: (List[word_part_t]) -> None
    """
    A here doc is like a double quoted context, except " and \" aren't special.
    """
    self._ReadLikeDQ(None, False, parts)

We can then use osh's awesome utility osh --tool tokens to peek at the results of the tokenization step.

$ osh --tool tokens
cat <<END
hello! \" \$ \\
END
    0 Lit_Chars            'cat'
    1 WS_Space             ' '
    2 Redir_DLess          '<<'
    3 Lit_Chars            'END'
    4 Op_Newline           '\n'
    5 Lit_Chars            'hello! '
    6 Lit_EscapedChar      '\\"'
    7 Lit_Chars            ' '
    8 Lit_EscapedChar      '\\$'
    9 Lit_Chars            ' '
   10 Lit_EscapedChar      '\\\\'
   11 Lit_Chars            '\n'
   12 Eof_Real             ''
   13 Lit_CharsWithoutPrefix ''
   14 Undefined_Tok        'END\n'
   15 Eof_Real             ''
(16 tokens)

\" is converted into a Lit_EscapedChar token, so we just need to add some logic to treat this differently. Sure enough _ReadLikeDQ has a clause predicated on Id.Lit_EscapedChar which extracts the second character in the token's string. We can simply replace the escaped char token with a token literal during parsing, and the problem is resolved!

 if self.token_type == Id.Lit_EscapedChar:
     ch = lexer.TokenSliceLeft(tok, 1)
+    if ch == '"' and is_here_doc:
+        tok.id = Id.Lit_Chars
+        part = tok
+    else:
-    part = word_part.EscapedLiteral(tok, ch)  # type: word_part_t
+        part = word_part.EscapedLiteral(tok, ch)  # type: word_part_t

Easy. Simple. 2 line change. All osh tests pass with flying colours.

Actually, it's worthwhile to mention how helpful the osh test suite is for making quick changes like these. I wasn't sure whether introducing sequential Lit_Chars tokens like we do above into the result of _ReadLikeDQ might break an implicit assumption somewhere else. However, these assumptions are either included in tests or considered bugs.

Second attempt: both parser & lexer

Nothing in life is easy. Any idea what's wrong with this code change? I'll give you 3 seconds to try to think up some weird reason... Okay, time's up!

This code is pretty good but it's breaking a contract between osh's parser and its lexer.

The source code passed into an interpreter will touch 3 main parts, in order:

Lexer -> processing strings sucks. It's horrible. Turn strings into tokens right at the start and live a good life (and also keep the parser fast).
Parser -> turn your stream of tokens into an AST. In our case we're a leaf node, so we get away with returning a list of "parts," where an escaped character is one part.
Evaluator -> make your AST nodes do the thing they're supposed to. Evaluating a command? Get your arguments, lookup the binary, then start that subprocess! Evaluating a here doc? Turn it into a string & pass it on.

Now remember 2 moments ago when we said strings suck so they shouldn't be in the parser? We check ch == '"' above. Why did we put strings in the parser?!

Instead, lets update our lexer's rule to turn \" into Lit_BackslashDoubleQuote. Then when it's time to convert, we compare tokens, not strings.

 # DQ stands for "double quote." Here docs share the same lexer rule as double quoted strings.
 LEXER_DEF[lex_mode_e.DQ] = [
     # R(...) stands for regex
-    R(r'\\[$`"\\]', Id.Lit_EscapedChar),
+    R(r'\\[$`\\]', Id.Lit_EscapedChar),
     # C(...) stands for constant
+    C('\\"', Id.Lit_BackslashDoubleQuote),
     C('\\\n', Id.Ignored_LineCont),
     C('\\', Id.Lit_BadBackslash),  # syntax error in YSH, but NOT in OSH
 ] + _LEFT_SUBS + _VARS + [
     R(r'[^$`"\0\\]+', Id.Lit_Chars),  # matches a line at most
     C('$', Id.Lit_Dollar),  # completion of var names relies on this
     # NOTE: When parsing here doc line, this token doesn't end it.
     C('"', Id.Right_DoubleQuote),
 ]

Then update the parser as well.

  if self.token_type == Id.Lit_EscapedChar:
     ch = lexer.TokenSliceLeft(tok, 1)
     part = word_part.EscapedLiteral(tok, ch)  # type: word_part_t
+
+ elif self.token_type == Id.Lit_BackslashDoubleQuote:
+    if left_token:
+        ch = lexer.TokenSliceLeft(tok, 1)
+        part = word_part.EscapedLiteral(tok, ch)
+    else:
+        # in here docs \" should not be escaped, staying as literal characters
+        tok.id = Id.Lit_Chars
+        part = tok

What not to do

While our new change looks good, there's a third way to do it. The wrong way. We could have implemented our fix by modifying only the evaluator, like so:

Mark all EscapedLiteral if they're part of here docs.
During evaluation, replace any " with \", undoing the original behaviour the parser.

This approach also solves the string escaping problem, but introduces a new seriously bad side-effect...

One might assume each interpreted line of code is lexed, parsed, and evaluated each time it's run. While this is possible, no serious interpreter works this way! Typically, lexing and parsing is only performed once, while evaluation is performed dynamically. We can see this distinction in the short example:

$ cat ./test.sh
#!./bin/osh

for i in $(seq 1 4); do
cat <<END
    loop \"$i\"
END
done

I then added a few trace statements to osh so we can see what's going on inside. Running this code now shows that we were partially right! ReadHereDocBody() only gets parsed once, but the same string is evaluated multiple times. Our hypothetical change would be to EscapedLiteral, so it really would be run each loop iteration.

$ ./test.sh
# This is the parsing stage for here docs
in ReadHereDocBody()

in _EvalWordPart('    loop ') @ Literal
in _EvalWordPart('\\"') @ EscapedLiteral
in _EvalWordPart('\\"') @ EscapedLiteral
in _EvalWordPart('\n') @ Literal
    loop \"1\"

in _EvalWordPart('    loop ') @ Literal
# Look! Look! Our escaped literals got evaluated a second time
in _EvalWordPart('\\"') @ EscapedLiteral
in _EvalWordPart('\\"') @ EscapedLiteral
in _EvalWordPart('\n') @ Literal
    loop \"2\"

in _EvalWordPart('    loop ') @ Literal
in _EvalWordPart('\\"') @ EscapedLiteral
in _EvalWordPart('\\"') @ EscapedLiteral
in _EvalWordPart('\n') @ Literal
    loop \"3\"

in _EvalWordPart('    loop ') @ Literal
in _EvalWordPart('\\"') @ EscapedLiteral
in _EvalWordPart('\\"') @ EscapedLiteral
in _EvalWordPart('\n') @ Literal
    loop \"4\"

The seriously bad side-effect from before was that evaluation can be performed over and over again if it's in a loop, or a repeated function call. Of course the performance of a few ops is almost nothing, but as a principle, it's great our fix didn't have to change the evaluator at all!

Nothing like a good feeling.

this was written without any use of generative ai back to blog