Inside Razor - Part 1 - Recursive Ping-Pong

This is the first of my blog posts about the parser for the new ASP.Net Razor syntax.  We've been working on this parser for a while now, and I want to share some of how it works with my readers!

The Razor parser is very different from the existing ASPX parser.  In fact, the ASPX parser is implemented almost entirely with Regular Expressions, because it is a very simple language to parse.  The Razor parser is actually separated into three components: 1) A Markup parser which has a basic understanding of HTML syntax, 2) A Code parser which has a basic understanding of either C# or VB and 3) A central orchestrator which understands how the two mix together.  Note that when I say "basic understanding" I mean basic, we're not talking about full-fledged C# and HTML parsers here.  I've joked with people on the team that we should call them "Markup Understander" or "Code Comprehender" instead :).

So the Razor parser has three "actors": The Core Parser, the Markup Parser and the Code Parser.  All three work together to parse a Razor document.  Now, let's take a Razor file and do a full summary of the parsing procedure using these actors.  We'll use the sample that I used last time:

<ul>
@foreach(var p in Model.Products) {
<li>@p.Name ($@p.Price)</li>
}
</ul>

Ok, now we start at the top. The Razor parser is essentially in one of three states at any time during the parsing: Parsing a Markup Document, Parsing a Markup Block or Parsing a Code Block.  The first two are handled by the Markup Parser, and the last is handled by the Code Parser.  So, when the Core Parser is fired up for the first time, it calls into the Markup Parser and asks it to parse a Markup Document and return the result.  Now the parser is in the Markup Document state.  In this state, it simply scans forward to the next "@" character, it doesn't care about tags or other HTML concepts, just "@".  When it reaches an "@", it makes a decision: "Is this a switch to code, or is it an email address?"  This decision is basically done by looking just before and just after the "@" to see if they are valid email characters.  This is the default convention, but there are escape sequences to force it to be treated as a switch to code.

In this case, when we see our first "@", it is preceded by whitespace, which is not valid in an email address.  So, we now know we are switching to code.  The Markup Parser calls into the Code Parser and asks it to parse a Code Block.  A Block, in terms of the Razor Parser, is basically a single chunk of Code or Markup with a clear start and end sequence.  So, the 'foreach' statement here is an example of a Code Block.  It starts at the "f" character and ends at the "}" character.  The Code Parser knows enough about C# to know this, so it starts parsing the code.  The Code Parser does some very simple tracking of C# statements, so when it gets to the "<li>" it knows it's at the start of a C# statement.  "<li>" is not something you can put at the start of a C# statement, so the Code Parser knows that this is the start of nested Markup Block.  So, it calls back into the Markup Parser, to have it parse a block of HTML.  This creates a sort of recursive ping-pong game between the Code and Markup parsers.  We start in Markup, then call into Code, then call into Markup and so on before finally returning back up this whole chain.  At the moment, the call stack in the parser looks something like this:

  • HtmlMarkupParser.ParseDocument()
    • CSharpCodeParser.ParseBlock()
      • HtmlMarkupParser.ParseBlock()

(Obviously, I am leaving out a lot of helper methods :)).

This highlights a fundamental difference between ASPX and Razor.  In an ASPX file, you can think of Code and Markup as two parallel streams.  You write some Markup, then you jump over and write some code, then you jump back and write some Markup, and so on.  A Razor file is like a tree.  You write some Markup, and then put some Code inside that Markup, then put some Markup inside that Code, and so on.

So, we've just called into the Markup Parser to parse a block of Markup, this block starts at "<li>" and ends at the matching "</li>".  Until that matching "</li>", we won't consider the Markup Block finished.  So even if you had a "}" somewhere inside the "<li>" it wouldn't terminate the "foreach", because we haven't come far enough up the stack yet.

While parsing the "<li>", the Markup Parser sees more "@" characters, which means even more calls into the Code Parser. And so the call stack grows:

  • HtmlMarkupParser.ParseDocument()
    • CSharpCodeParser.ParseBlock()
      • HtmlMarkupParser.ParseBlock()
        • CSharpCodeParser.ParseBlock()

I'll go into detail on how these blocks are terminated later, because it is a little complicated, but eventually we finish these code blocks and we're back in the "<li>" block.  Then, we see "</li>" so we finish that block and pop back up to the "foreach" block.  The "}" terminates that block, so we back up to the top of our stack again: the Markup Document.  Then we read until the end of the file, not finding anymore "@" characters.  And we're done!  We've parsed the entire file!

I hope that's made the general structure of the parsing algorithm somewhat more clear.  The key take-away here is to avoid thinking of Code and Markup as separate streams and think of them as constructs you nest inside each other.  Our next topic will be Implicit Expressions, which is the logic that allows us to detect what parts of "@p.Name ($@p.Price)" are code, and what are markup.  I'll give you a hint, we took some inspiration from PowerShell here ;).

Please post any questions or comments in the comments section or email me at "andrew AT andrewnurse DOT net"!