-
Notifications
You must be signed in to change notification settings - Fork 2
quick parser combinator tutorial
Say I have this simple calculator language:
Data Expr = Add Expr Expr | Number Int
I want to read things like 2 + 2 + 3 + 4 and turn them into something like
Add (Add (Add 2 2) 3) 4
Or, the abstract syntax for my language. We could use a parser generator (Happy is the major one for Haskell, and is actually used to parse Haskell code), but there's another way to do this that is a little bit easier to maintain, I think: parser combinator libraries. The two major ones in Haskell are Parsec and Megaparsec. We are currently using Parsec but we might want to migrate to Megaparsec, it supposedly has better error messages.
Parser combinators let us write out larger parsers as sequences or combinations of smaller ones. In this way, we sort of build up to our actual parser by writing out the smaller bits piece by piece. Parsec gives us some very basic parsers to start out with:
-
digitparses a single digit -
string xparses anything that matches x and then some combinators to produce more complex parsers out of the simpler ones: -
many xwill repeat the parse x zero or more times and give you back a list of the successful parses (remember,String==[Char]) -
many1 xwill repeat the parse x one or more times --- - the operator
x <|> yis how you express choice. Parsec will first tryxand ifxfails, it will then tryy. (WARNING: Parsec parses consume their input! If a parser fails halfway through, everything it ate up to that point is gone!) - To deal with the caveat above, there's a combinator
try x. The parsexwill be attempted, and if it fails, it will backtrack, putting everything the parserxate back.
Parsec is monadic, so we can use do notation to write out the ordering:
parseInt = do
x <- many digit -- This gets a list of digits
return $ read x -- read is a builtin Haskell function that tries to turn a string into an int (technically can turn a string into anything, its polymorphic)
In do notation, the arrow (x <- digit) means "do this monadic computation, and store its unwrapped output in x".
Haskell types can look really gross. ParsecT is a monad transformer, which we don't need to know about at all, but it really messes up the type signature. Most of the functions you can look up that do monad things are labeled with m as the arbitrary monad, and a is the thing inside the monad's "box" (cf. return :: a -> m a). With monads, just read everything but the last type as that m. So if we have a type
Awful Bad Terrible Long Type Really Annoying Int
, Awful Bad Terrible Long Type Really Annoying is m, and Int is a, so return 3 will give us the type
Awful Bad Terrible Long Type Really Annoying Int
.
We can see from the type of digit:
digit :: ParsecT s u m Char
That digit is a Char wrapped up in a box. Now look at the type of many
many :: ParsecT s u m a -> ParsecT s u m [a]
many takes a thing inside of a parser box, and then gives us back a parser box with a list of that thing.
so, many digit is then:
many digit :: ParsecT s u m [Char]
or
ParsecT s u m String -- String == [Char]
so then we can give the result x (which we unwrapped with <-) to the function read, which turns it into an Int for us, and then we put that normal, unboxed up Int into the datatype, and then we return that datatype to create the parser, which is of type ParsecT s u m Int! (Recall: return :: a -> m a, specialized to Int gives return Int -> m Int
But most things aren't as simple as "just a number". In particular, we want to also parse