Skip to content

quick parser combinator tutorial

Kai Gay edited this page Jan 10, 2020 · 5 revisions

Intro

Say I have this simple calculator language:

Data Expr = Add Expr Expr | Number Int

I want to read things like 2 + 2 + 3 + 4 and turn them into something like

Add (Add (Add 2 2) 3) 4

Or, the abstract syntax for my language. We could use a parser generator (Happy is the major one for Haskell, and is actually used to parse Haskell code), but there's another way to do this that is a little bit easier to maintain, I think: parser combinator libraries. The two major ones in Haskell are Parsec and Megaparsec. We are currently using Parsec but we might want to migrate to Megaparsec, it supposedly has better error messages.

Parser Combinators

Parser combinators let us write out larger parsers as sequences or combinations of smaller ones. In this way, we sort of build up to our actual parser by writing out the smaller bits piece by piece. Parsec gives us some very basic parsers to start out with:

  • digit parses a single digit
  • string x parses anything that matches x and then some combinators to produce more complex parsers out of the simpler ones:
  • many x will repeat the parse x zero or more times and give you back a list of the successful parses (remember, String == [Char])
  • many1 x will repeat the parse x one or more times ---
  • the operator x <|> y is how you express choice. Parsec will first try x and if x fails, it will then try y. (WARNING: Parsec parses consume their input! If a parser fails halfway through, everything it ate up to that point is gone!)
  • To deal with the caveat above, there's a combinator try x. The parse x will be attempted, and if it fails, it will backtrack, putting everything the parser x ate back.

Parsec is monadic, so we can use do notation to write out the ordering:

parseInt = do
  x <- many digit -- This gets a list of digits
  return $ read x -- read is a builtin Haskell function that tries to turn a string into an int (technically can turn a string into anything, its polymorphic) 

In do notation, the arrow (x <- digit) means "do this monadic computation, and store its unwrapped output in x".

QUICK DIVERSION ABOUT READING TYPES

Haskell types can look really gross. ParsecT is a monad transformer, which we don't need to know about at all, but it really messes up the type signature. Most of the functions you can look up that do monad things are labeled with m as the arbitrary monad, and a is the thing inside the monad's "box" (cf. return :: a -> m a). With monads, just read everything but the last type as that m. So if we have a type

Awful Bad Terrible Long Type Really Annoying Int

, Awful Bad Terrible Long Type Really Annoying is m, and Int is a, so return 3 will give us the type

Awful Bad Terrible Long Type Really Annoying Int

.

back to parsers

We can see from the type of digit:

digit :: ParsecT s u m Char

That digit is a Char wrapped up in a box. Now look at the type of many

many :: ParsecT s u m a -> ParsecT s u m [a]

many takes a thing inside of a parser box, and then gives us back a parser box with a list of that thing. so, many digit is then:

many digit :: ParsecT s u m [Char]

or

ParsecT s u m String -- String == [Char]

so then we can give the result x (which we unwrapped with <-) to the function read, which turns it into an Int for us, and then we put that normal, unboxed up Int into the datatype, and then we return that datatype to create the parser, which is of type ParsecT s u m Int! (Recall: return :: a -> m a, specialized to Int gives return Int -> m Int

Recursion

But most things aren't as simple as "just a number". In particular, we want to also parse

Clone this wiki locally