A quick summary of important points:
- The parser is zero-copy doesn’t duplicate the source string in the generated AST.
- Parsing takes advantage of caching any successful sub-parses to minimize backtracking.
- Nodes are stored in an index-based arena (a
Vec) for more straightforward iteration and lifetime management
At a high level, the parsing strategy is to try to (almost) consecutively apply each potential item’s parse method, and determine if it returns a successful result. If the result is is not successful,
either move on to the next available item, or the default parser.
For elements, the default parser is Paragraph::parse and for objects, the default parser is parse_text().
We match based on the first character to decide which item’s parser to try. For example, if we match on #, we’d first try Block::parse(), then Keyword::parse() and so on. If we match on |, we’d first try Table::parse(), then move on to the default parser (Paragraph::parse()). This approach allows us to skip trying most of the parsing functions for a given input.
The typical transition is:
parse_org(): entry point to the parser, runsparse_elementin a loop.parse_element(): parses elementsParagraph::parse(): if no specific elements are found: handle the default paragraph element. Paragraphs contain objects. Runsparse_object()in a loop.parse_object(): parses objectsparse_text(): if no objects are recognized, interpret them as text
The parser was implemented manually without the use of parser combinator libraries to keep dependencies low and to have more flexibility with the implementation and performance.
Parseable is the trait that provides the parse() method for each element/object struct. It returns a NodeID. Corresponding to the element that was placed inside the NodePool.
NodePool is the index arena that Node’s are stored in. Using an arena helps simplify lifetimes and provides easy iteration over all elements in the AST. We pass a mutable reference to the arena to each within each parse() to fill it up when needed.
Each Node contains an Expr (which maps to an actual AST item) and additional metadata, which is useful during parsing / exporting.
We take in a &str and turn it into a byte array (&[u8]) with a Cursor. Cursor has some helpful utility functions implemented to make the parsing functions easier to write and more legible. We also avoid re-allocating the input this way.
The parsing function we attempt to use can make significant progress into parsing, even accumulating child nodes of its own before failing (such as in the case of improperly closed markup). So in theory, we’d be heavily backtracking and re-parsing elements we’ve already seen!
To avoid this, we try to cache the progress we’ve made within each parsing function (like a packrat parser).
Not all progress can be cached, especially in the case of “state changes”, like in a #+begin_src block where the contents aren’t org.
This isn’t a big deal for non cache-able elements since they’re quicker to parse.
- Helpful for understanding how a <<packrat>> parser works: https://blog.bruce-hill.com/packrat-parsing-from-scratch
- Motivation behind going for a flattened arena-based AST: https://www.cs.cornell.edu/~asampson/blog/flattening.html