Welcome to the beginning of the compiler proper! If you haven't yet, check out part 0 for a description of the project and help setting up the development environment.
A note about the presentation structure:
We will be going through each phase of the compiler mostly in its entirety before moving onto the next phase. Most other tutorials and books don't do this. Instead, they build up the language by implementing one feature at a time at each stage of the pipeline. When building your own compiler, I strongly recommend following this approach, as opposed to trying to write the complete parser, or the complete semantic analyzer, in one go, as having the working pipeline, however minimal, to test with, is invaluable for debugging.
I've decided to present a fairly complete version of the language here at every pass for better code locality, and because I've already written the whole thing, and it would be a lot of work to split it back up into separate features. In future parts, we'll explore adding features to an already complete pipeline, but they'll be ones that I haven't implemented yet.
Defining the language
First, we'll encode the C abstract syntax tree in Haskell types. 1 C expressions can be literal integers, floats, characters, strings (we'll say much more about strings later), booleans, null (I'm sorry), and other, more complicated forms.
module Microc.Ast where
import Data.Text ( Text )
import Data.Text.Prettyprint.Doc
import Data.Char ( chr )
-- Binary operators
data Op = Add
| Sub
| Mult
| Div
-- I've added an exponentiation operator ** for fun
| Power
| Equal
| Neq
| Less
| Leq
| Greater
| Geq
| And
| Or
| BitAnd
| BitOr
deriving (Show, Eq)
-- Unary operators
-- (no post/pre-fix increment or decrement for simplicity)
data Uop = Neg
| Not
deriving (Show, Eq)
data Expr = Literal Int
| StrLit Text
| CharLit Int -- ^ Chars get lowered to ints during codegen
| Fliteral Double
| BoolLit Bool -- ^ true | false
| Null
| Id Text -- ^ variables
| Binop Op Expr Expr -- ^ binary operators
| Unop Uop Expr -- ^ unary operators (just - and !)
| Call Text [Expr] -- ^ f(x, y, z)
| Cast Type Expr -- ^ (int *)malloc(100)
| Access Expr Expr -- ^ struct access
| Deref Expr -- ^ *pointer
| Addr Expr -- ^ &expression
| Assign Expr Expr
| Sizeof Type
| Noexpr -- ^ used for dangling if's
deriving (Show, Eq)
Statements can be naked expressions, blocks, returns, or control flow. Our language will only have if
, for
, and while
. At some later time, I might experiment with adding goto
, break
, and continue
as they're quite difficult/interesting to generate llvm for, but for now we have enough to worry about.
data Statement = Expr Expr
| Block [Statement]
| Return Expr
| If Expr Statement Statement
| For Expr Expr Expr Statement
| While Expr Statement
deriving (Show, Eq)
The notable omissions from C's type system are unions, arrays, function pointers, and typedefs. Out of C's vast array of numeric types, we just support int
(32-bit) and float
(actually double precision), as implementing the rest of them isn't spectacularly interesting. We will, however, have normal pointers and structs, which are plenty powerful and at on their own.
data Type = Pointer Type
| TyInt
| TyBool
| TyFloat
| TyChar
| TyVoid
| TyStruct Text
deriving (Show, Eq)
Finally, with function and at definitions, we can represent whole (single file) Micro C programs.
data Bind = Bind { bindType :: Type, bindName :: Text }
deriving (Show, Eq)
data Struct = Struct { structName :: Text, structFields :: [Bind] }
deriving (Show, Eq)
data Function = Function
{ typ :: Type
, name :: Text
, formals :: [Bind]
, locals :: [Bind]
, body :: [Statement]
}
deriving (Show, Eq)
data Program = Program [Struct] [Bind] [Function] deriving (Eq, Show)
(Full source of the AST here)
Using the excellent prettyprinter library, we can define Pretty
instances for the AST types which will allow us to serialize our AST back to valid C source. They are elided in this post, as they're very mechanical to write, but they are at the end of the linked source, for anyone interested.2
Parsing the language
Parser theory is a vast field, and one of the oldest in computer science. After all, we needed to be able to parse computer languages in order to stop having to write assembly and move on to higher level problems. Writing parsers is also a favorite pastime of the haskell community; there are 210 libraries in the "Parsing" category on hackage alone, ranging from parsec-style combinator libraries, to lex/yacc style parser generators, to implementations of Earley's algorithm, and more. I cannot claim to give even a remotely thorough treatment of various parsing strategies here, but I can at least demonstrate two different methods of parsing and compare them.
Alex/Happy
First, we'll discuss using haskell's lex/yacc style parser generator libraries. Alex is a lexer generator, similar to Lex that, given a set of rules with regular expressions, transforms the string of a source file into a series of tokens. The top of the Alex file declares the generated module name and necessary imports.
{
module Microc.Scanner.Generator where
import Microc.Ast
}
Then, we specify the "wrapper" type and some regex variables.
%wrapper "basic"
$alpha = [a-zA-Z]
$digit = 0-9
$newline = [\r\n]
The meat of the lexer is the rule for how to transform characters into tokens. Code inside braces denotes a haskell function of type String -> Lexeme
which is called on whatever matches the regex rule on the left. A semicolon tells Alex to ignore the matching text, which is how comments are implemented.
tokens :-
$white+ ;
"/*" ( $newline | [^\*] | \*+ ($newline | [^\/]) )* "*/" ;
"//" [^$newline]* $newline ;
\( { const LPAREN }
\) { const RPAREN }
\{ { const LBRACE }
\} { const RBRACE }
\; { const LSemi }
\, { const LComma }
\+ { const LAdd }
\- { const LSub }
\* { const LMul }
\/ { const LDiv }
\= { const LAssign }
\=\= { const LEqual }
\!\= { const LNeq }
\< { const LLess }
\<\= { const LLeq }
\> { const LGreater }
\>\= { const LGeq }
\&\& { const LAnd }
\|\| { const LOr }
\! { const LNot }
\& { const LBitAnd }
\| { const LBitOr }
\*\* { const LPow }
\. { const LDot }
\-\> { const LArrow }
"if" { const LIf }
"else" { const LElse }
"for" { const LFor }
"while" { const LWhile }
"return" { const LRet }
"int" { const $ LType TyInt }
"float" { const $ LType TyFloat }
"bool" { const $ LType TyBool }
"char" { const $ LType TyChar }
"void" { const $ LType TyVoid }
"struct" { const LStruct }
"true" { const $ LBool True }
"false" { const $ LBool False }
"NULL" { const LNull }
"sizeof" { const LSizeof }
Variables and numeric and string literals are the most interesting pieces of the scanner, as we need to handle floating point conventions and string parsing. Fortunately, we can abuse haskell's read
function to do most of the heavy lifting for us.
$digit+ { LInt . read }
$digit+ \. $digit* ( [eE] [\+\-]? $digit+ )? { LFloat . read }
$alpha [$alpha $digit \_]* { LId }
\" [^\"]* \" { LStrLit . read -- this doesn't handle quote escaping }
\' [^\'\\] \' { LCharLit . ord . head . init . tail }
\'\\$digit+\' { LCharLit . read . init . drop 2 }
Finally, we have the definition of the Lexeme
type.
{
data Lexeme = LInt Int
| LFloat Double
| LStrLit String
| LCharLit Int
| LId String
| LType Type
| LStruct
| LBool Bool
| LNull
| LRet
| LAssign
| LComma
| LSemi
| LPAREN
| RPAREN
| LBRACE
| RBRACE
| LBRACK
| RBRACK
| LFor
| LWhile
| LIf
| LElse
| LAdd
| LSub
| LMul
| LDiv
| LEqual
| LNeq
| LLess
| LLeq
| LGreater
| LGeq
| LAnd
| LOr
| LNot
| LBitAnd
| LBitOr
| LPow
| LDot
| LArrow
| LSizeof
}
(Full source of the alex lexer here.)
After scanning the source and generating the token list, we pass it to happy, which generates an LR bottom up parser for our language. As with alex, the beginning of the file is the haskell module and import specification.
{
module Microc.Parser.Generator where
import Microc.Scanner.Generator
import Microc.Ast
import Data.Text (pack)
import Prelude hiding (fst, snd)
}
Then, we declare the name of the generated parsing function, the token type it will act on, and what to do in case of errors.
%name parse
%tokentype { Lexeme }
%error { parseError }
Happy also requires us to re-declare all of the token types from our lexer generator. The reason for this is unclear to me, as similar tools in other languages don't have this step. However, at least we can use symbols to denote operators instead of writing LBRACE
everywhere. The $$
's represent the arguments of the Lexeme
constructor.
%token
int { LInt $$ }
float { LFloat $$ }
id { LId $$ }
ptype { LType $$ }
char { LCharLit $$ }
string { LStrLit $$ }
bool { LBool $$ }
null { LNull }
return { LRet }
struct { LStruct }
sizeof { LSizeof }
'=' { LAssign }
',' { LComma }
';' { LSemi }
'(' { LPAREN }
')' { RPAREN }
'{' { LBRACE }
'}' { RBRACE }
for { LFor }
while { LWhile }
if { LIf }
else { LElse }
'+' { LAdd }
'-' { LSub }
'*' { LMul }
'/' { LDiv }
'==' { LEqual }
'!=' { LNeq }
'<' { LLess }
'<=' { LLeq }
'>' { LGreater }
'>=' { LGeq }
'&&' { LAnd }
'||' { LOr }
'!' { LNot }
'&' { LBitAnd }
'|' { LBitOr }
'**' { LPow }
'.' { LDot }
'->' { LArrow }
Next, we have the precedence rules for our operators going from lowest to highest. We include a dummy NOELSE
token to solve ambiguities arising from a possible dangling else.
%nonassoc NOELSE
%nonassoc else
%right '='
%left '|'
%left '&'
%left '||'
%left '&&'
%left '==' '!='
%left '<' '>' '<=' '>='
%left '+' '-'
%left '*' '/'
%right '**'
%right '!' NEG
%left '.' '->'
Now, the important part of the parser: the recursive parsing rules. Care needs to be taken to make sure that as much as possible, recursive calls to rules happen at the left of a production, as bottom up parsers can deal with these in constant space, whereas right recursion can blow up with sufficiently nested files3. This means that when parsing lists of items, we'll need to reverse at some point. As with alex, code in braces is valid haskell, with $1
, $2
, etc. representing productions from the 1st, 2nd, etc. rules on the left. The %%
is required by happy to indicate the beginning of the production rules.
Proceeding top-down, a program consists of a list of struct declarations, global variables, and functions.
%%
program:
decls { Program (reverse $ fst $1) (reverse $ snd $1) (reverse $ thd $1) }
decls:
{- empty -} { ([], [], []) }
| decls sdecl { (($2 : fst $1), (snd $1), (thd $1)) }
| decls vdecl { ((fst $1), ($2 : snd $1), (thd $1)) }
| decls fdecl { ((fst $1), (snd $1), ($2 : thd $1)) }
A function declaration is a return type followed by a left paren, followed by 0 or more arguments, followed by a right paren, followed by a list of local variables and statements all enclosed in curly braces.
fdecl:
typ id '(' formals_opt ')' '{' vdecl_list stmt_list '}'
{ Function { typ = $1,
name = pack $2,
formals = $4,
locals = reverse $7,
body = reverse $8 } }
formals_opt:
{- empty -} { [] }
| formal_list { reverse $1 }
formal_list:
typ id { [Bind $1 (pack $2)] }
| formal_list ',' typ id { Bind $3 (pack $4) : $1 }
Types are either 0 or more levels of indirection to a raw type or struct. I added pointers to the language after I added the exponentiation operator, so I didn't realize the problems it would cause with pointer dereferencing, but we can use a clever trick to "untokenize" the exponentiation operator back into two dereferencing operators when appropriate.
typ:
ptype stars { foldr (const Pointer) $1 $2 }
| struct id stars { foldr (const Pointer) (TyStruct (pack $2)) $3 }
stars:
{ [] }
| stars '*' { $2 : $1 }
-- A hack to get around the power operator clashing with
-- the dereferencing operator
| stars '**' { $2 : $2 : $1 }
Struct declarations are the keyword struct
followed by a list of variable declarations enclosed in curly braces.
sdecl:
struct id '{' vdecl_list '}' ';' { Struct (pack $2) (reverse $4) }
vdecl_list:
{- empty -} { [] }
| vdecl_list vdecl { $2 : $1 }
vdecl:
typ id ';' { Bind $1 (pack $2) }
Now we have the rules for parsing statements and expressions. The statement code is fairly straightforward except that we have to decorate an if
statement with no else
block with %prec NOELSE
in order to avoid ambiguity in the grammar. We also have a good demonstration of the usefulness of Noexpr
as a placeholder for empty slots in for loops or naked returns.
stmt_list:
{- empty -} { [] }
| stmt_list stmt { $2 : $1 }
stmt:
expr ';' { Expr $1 }
| return ';' { Return Noexpr }
| return expr ';' { Return $2 }
| '{' stmt_list '}' { Block (reverse $2) }
| if '(' expr ')' stmt %prec NOELSE { If $3 $5 (Block []) }
| if '(' expr ')' stmt else stmt { If $3 $5 $7 }
| for '(' expr_opt ';' expr ';' expr_opt ')' stmt { For $3 $5 $7 $9 }
| while '(' expr ')' stmt { While $3 $5 }
expr_opt:
{- empty -} { Noexpr }
| expr { $1 }
The expression rule is also fairly straightforward, except the interaction of the power operator with pointer dereferencing. We also use the %prec NEG
directive to force prefix operators to bind very as tightly as possible.4
expr:
int { Literal $1 }
| float { Fliteral $1 }
| char { CharLit $1 }
| string { StrLit (pack $1) }
| bool { BoolLit $1 }
| null { Null }
| id { Id (pack $1) }
| expr '+' expr { Binop Add $1 $3 }
| expr '-' expr { Binop Sub $1 $3 }
| expr '*' expr { Binop Mult $1 $3 }
| expr '/' expr { Binop Div $1 $3 }
| expr '==' expr { Binop Equal $1 $3 }
| expr '!=' expr { Binop Neq $1 $3 }
| expr '<' expr { Binop Less $1 $3 }
| expr '<=' expr { Binop Leq $1 $3 }
| expr '>' expr { Binop Greater $1 $3 }
| expr '>=' expr { Binop Geq $1 $3 }
| expr '&' expr { Binop BitAnd $1 $3 }
| expr '|' expr { Binop BitOr $1 $3 }
| expr '&&' expr { Binop And $1 $3 }
| expr '||' expr { Binop Or $1 $3 }
| expr '**' expr { Binop Power $1 $3 }
| '-' expr %prec NEG { Unop Neg $2 }
| '*' expr %prec NEG { Deref $2 }
-- A hack to get around having the power operator
| '**' expr %prec NEG { Deref (Deref $2) }
| '&' expr %prec NEG { Addr $2 }
| '!' expr { Unop Not $2 }
| expr '=' expr { Assign $1 $3 }
| id '(' actuals_opt ')' { Call (pack $1) $3 }
| '(' typ ')' expr %prec NEG { Cast $2 $4 }
| expr '.' expr { Access $1 $3 }
| expr '->' expr { Access (Deref $1) $3}
| sizeof '(' typ ')' { Sizeof $3 }
| '(' expr ')' { $2 }
actuals_opt:
{- empty -} { [] }
| actuals_list { reverse $1 }
actuals_list:
expr { [$1] }
| actuals_list ',' expr { $3 : $1 }
Finally, the footer contains the parseError
function as well as some utilities.
{
parseError _ = error "Unable to parse tokens"
fst (a, _, _) = a
snd (_, b, _) = b
thd (_, _, c) = c
}
Cabal automatically knows how to call alex and happy to generate the relevant haskell modules which will provide functions alexScanTokens :: String -> [Lexeme]
5 and parse :: [Lexeme] -> Program
.
(Full source of the happy parser here.)
Megaparsec
Now, we'll write the same scanner and parser using megaparsec. We could actually reuse the tokens generated by alex with megaparsec and skip scanning, but it's instructive to see how megaparsec handles the whole process (I also had the megaparsec version working long before I even considered using a parser generator). For further reading, I recommend the official megaparsec tutorial.
For the scanner, we have our standard import boilerplate as well as the definition of our parser type as the vanilla Parsec
monad with no custom state component consuming Text
.
module Microc.Scanner.Combinator where
import Data.Void
import Data.Char
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
import Data.Text ( Text )
import qualified Data.Text as T
import Control.Monad ( void )
import Data.String.Conversions
type Parser = Parsec Void Text
We then define our "space consumer," which we use to create higher order functions to deal with lexemes and symbols so that they intelligently handle whitespace.
sc :: Parser ()
sc = L.space space1 lineCmnt blockCmnt
where
lineCmnt = L.skipLineComment "//"
blockCmnt = L.skipBlockComment "/*" "*/"
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc
symbol :: Text -> Parser Text
symbol = L.symbol sc
Then, we create some convenience functions to handle paired delimiters and common symbols.
parens :: Parser a -> Parser a
parens = between (symbol "(") (symbol ")")
braces :: Parser a -> Parser a
braces = between (symbol "{") (symbol "}")
dquotes :: Parser a -> Parser a
dquotes = between (single '"') (single '"')
squotes :: Parser a -> Parser a
squotes = between (single '\'') (single '\'')
semi :: Parser ()
semi = void $ symbol ";"
comma :: Parser ()
comma = void $ symbol ","
star :: Parser ()
star = void $ symbol "*"
To handle reserved words, we attempt to parse a given string, not followed by anything except for whitespace. If we fail, we backtrack the parser so that it doesn't get stuck in the middle of a word.
rword :: Text -> Parser ()
rword w = (lexeme . try) (string w *> notFollowedBy alphaNumChar)
rws :: [Text] -- list of reserved words
rws =
[ "if"
, "then"
, "else"
, "while"
, "true"
, "false"
, "for"
, "int"
, "bool"
, "char"
, "float"
, "void"
, "return"
, "struct"
, "NULL"
, "sizeof"
]
Like with alex, we take advantage of haskell's read
function to parse strings and character literals.
strlit :: Parser Text
strlit = do
content <- dquotes $ takeWhileP Nothing (/= '"')
pure $ T.pack (read ('"' : cs content ++ "\""))
charlit :: Parser Int
charlit =
squotes $ (ord <$> satisfy (`notElem` ['\\', '\'']))
<|> (single '\\' >> int)
Identifiers are a little more cumbersome in megaparsec than the $alpha [$alpha $digit \_]*
rule in alex because we have to also check that they aren't the same as any reserved words.
identifier :: Parser Text
identifier = (lexeme . try) (p >>= check)
where
p = fmap T.pack $ (:) <$> letterChar
<*> many (alphaNumChar <|> single '_')
check x = if x `elem` rws
then fail $ "keyword " <> show x <> " cannot be an identifier"
else return x
Lexing int
's and float
's, however, is already implemented for us by the library, so we just wrap the definitions in our lexeme
combinator.
int :: Parser Int
int = lexeme L.decimal
float :: Parser Double
float = lexeme L.float
(Full source for the megaparsec scanner here.)
The parser begins similar to the scanner.
module Microc.Parser.Combinator
( programP
, runParser
, errorBundlePretty
)
where
import Microc.Ast
import Microc.Scanner.Combinator
import Text.Megaparsec
import Control.Monad.Combinators.Expr
import Control.Applicative (liftA2, liftA3)
import Data.Either
To handle operator precedence, we construct a value of type [[Operator Parser Expr]]
which we will later pass to makeExprParser
. Ordering is from highest to lowest precedence, with operators in the same list having equal priority. Each constructor of the Operator
type takes a function of type Parser (Expr -> Expr -> Expr)
or Parser (Expr -> Expr)
depending on if the operator is unary or binary and dispatches said parser appropriately for each operator.
Some contortions are necessary to handle chained prefix operators like multiple pointer dereferences, double negation, and operators that are prefixes of operators, like |
for bitwise or and ||
for logical or. Had we reused the tokens from alex, this would not be a problem.
opTable :: [[Operator Parser Expr]]
opTable =
[ [ InfixL $ Access <$ symbol "."
, InfixL $ (\lhs rhs -> Access (Deref lhs) rhs) <$ symbol "->"
]
, [ unary (Unop Neg) "-"
, unary (Unop Not) "!"
, unary Deref "*"
, unary Addr "&"
]
, [infixR Power "**"]
, [infixL Mult "*", infixL Div "/"]
, [infixL Add "+", infixL Sub "-"]
, [infixL Leq "<=", infixL Geq ">=", infixL Less "<", infixL Greater ">"]
, [infixL' Equal "==", infixL Neq "!="]
, [infixL' BitAnd "&"]
, [infixL' BitOr "|"]
, [infixL' And "&&"]
, [infixL' Or "||"]
, [InfixR $ Assign <$ symbol "="]
]
where
-- Megaparsec doesn't support multiple prefix operators by default,
-- but we need this in order to parse things like double negatives,
-- nots, and dereferences
unary op sym = Prefix $ foldr1 (.) <$> some (op <$ symbol sym)
infixL op sym = InfixL $ Binop op <$ symbol sym
-- Primed infixL' is useful for operators which are prefixes of other operators
infixL' op sym = InfixL $ Binop op <$ operator sym
infixR op sym = InfixR $ Binop op <$ symbol sym
operator sym = lexeme $ try (symbol sym <* notFollowedBy opChar)
opChar = oneOf ("!#$%&*+./<=>?@\\^|-~" :: [Char])
We can now write the rest of our expression parser.
termP :: Parser Expr
termP = try (Cast <$> parens typeP <*> exprP)
<|> parens exprP
<|> Null <$ rword "NULL"
<|> try (Fliteral <$> float)
<|> Literal <$> int
<|> BoolLit <$> (True <$ rword "true" <|> False <$ rword "false")
<|> Sizeof <$> (rword "sizeof" *> parens typeP)
<|> try (Call <$> identifier <*> parens (exprP `sepBy` comma))
<|> CharLit <$> charlit
<|> StrLit <$> strlit
<|> Id <$> identifier
exprP :: Parser Expr
exprP = makeExprParser termP opTable
exprMaybe :: Parser Expr
exprMaybe = option Noexpr exprP
Unlike in our parser generator, we need to be very careful about the ordering or our alternatives and where to include the try
combinator so that our parser doesn't end up in an invalid state or match on characters too eagerly.
Struct, type, and variable declarations are all straightforward.
structP :: Parser Struct
structP = Struct <$> (rword "struct" *> identifier) <*> braces (many vdeclP) <* semi
typeP :: Parser Type
typeP = do
baseType <- TyInt <$ rword "int"
<|> TyBool <$ rword "bool"
<|> TyFloat <$ rword "float"
<|> TyChar <$ rword "char"
<|> TyVoid <$ rword "void"
<|> TyStruct <$> (rword "struct" *> identifier)
foldr (const Pointer) baseType <$> many star
vdeclP :: Parser Bind
vdeclP = Bind <$> typeP <*> identifier <* semi
Statements and function definitions are very similar to what we wrote in happy. For dangling else's, we don't have to do anything particularly special, as megaparsec isn't of the class of parsers that check for ambiguities.
statementP :: Parser Statement
statementP = Expr <$> exprP <* semi
<|> Return <$> (rword "return" *> exprMaybe <* semi)
<|> Block <$> braces (many statementP)
<|> ifP
<|> forP
<|> whileP
ifP :: Parser Statement
ifP = liftA3 If (rword "if" *> parens exprP) statementP maybeElse
where maybeElse = option (Block []) (rword "else" *> statementP)
forP :: Parser Statement
forP = do
rword "for"
(e1, e2, e3) <- parens
$ liftA3 (,,) (exprMaybe <* semi) (exprP <* semi) exprMaybe
For e1 e2 e3 <$> statementP
whileP :: Parser Statement
whileP = liftA2 While (rword "while" *> parens exprP) statementP
fdeclP :: Parser Function
fdeclP = Function <$> typeP <*> identifier <*> formalsP
<*> (symbol "{" *> many vdeclP)
<*> (many statementP <* symbol "}")
formalsP :: Parser [Bind]
formalsP = parens $ formalP `sepBy` comma
where formalP = liftA2 Bind typeP identifier
Finally, a full program is potentially some whitespace followed by a list of structs and global variables followed by a list of functions and the end of the file.
programP :: Parser Program
programP = between sc eof $ do
structsOrGlobals <- many $ try (Left <$> structP) <|> (Right <$> try vdeclP)
let structs = lefts structsOrGlobals
globals = rights structsOrGlobals
Program structs globals <$> many fdeclP
(Full source of the megaparsec parser here.)
Wiring everything together
Now that we've written our parsers, presumably we'll want to run them on some actual source files. First, we'll create a Microc.hs
file in src/
that exports all of the Microc modules that we've written so far.
module Microc
( module X
)
where
import Microc.Ast as X
import Microc.Scanner.Combinator as X
import Microc.Parser.Combinator as X
import Microc.Scanner.Generator as X
import Microc.Parser.Generator as X
In app/Main.hs
, we'll set up an options parser using optparse-applicative so we can tell our executable to pretty print the parsed AST, the semantically checked AST, or the LLVM output, or compile the executable. The run option is just a convenience that will compile the executable, run it and then cleanup intermediate files.
module Main where
-- Microc parser conflicts with Options.Applicative parser
import Microc hiding ( Parser )
import Options.Applicative
import LLVM.Pretty
import Data.String.Conversions
import qualified Data.Text as T
import qualified Data.Text.IO as T
import Text.Pretty.Simple
import Data.Text.Prettyprint.Doc
import Data.Text.Prettyprint.Doc.Render.Text
data Action = Ast
-- TODO
| Sast | LLVM | Compile FilePath | Run
data ParserType = Combinator -- ^ megaparsec
| Generator -- ^ alex/happy
data Options = Options { action :: Action
, infile :: FilePath
, parser :: ParserType }
We want the behavior of our executable to be:
mcc --ast <file> # prints the ast of the file
mcc --ast --generator <file> # parses using alex/happy and prints the ast
mcc --sast <file> # prints the semantically checked ast of the file
mcc --llvm <file> # prints the generated llvm bytecode
mcc --compile <file> -o a.out # produces an executable a.out
mcc <file> # compiles the file and prints the output from running it
# etc.
This can be mostly accomplished using the flag'
combinator which allows the use of --options
as toggles.
actionP :: Parser Action
actionP =
flag' Ast (long "ast" <> short 'a' <> help "Pretty print the ast")
<|> flag' Sast (long "sast" <> short 's' <> help "Pretty print the sast")
<|> flag'
LLVM
(long "llvm" <> short 'l' <> help "Pretty print the generated llvm")
<|> flag' Compile
(long "compile" <> short 'c' <> help "Compile to an executable")
<*> strOption (short 'o' <> value "a.out" <> metavar "FILE")
-- running the file to see the expected output is default
<|> pure Run
parserP :: Parser ParserType
parserP =
flag'
Combinator
(long "combinator" <> help "Use the megaparsec parser implementation (default).")
<|> flag'
Generator
(long "generator" <> short 'g' <> help "Use alex and happy to parse.")
<|> pure Combinator -- default to megaparsec
optionsP :: Parser Options
optionsP =
Options
<$> actionP
<*> strArgument (help "Source file" <> metavar "FILE")
<*> parserP
The main logic lives in runOpts :: Options -> IO()
which is called by main
in conjunction with the standard invocation of execParser
from optparse-applicative
.
main :: IO ()
main = runOpts =<< execParser (optionsP `withInfo` infoString)
where
withInfo opts desc = info (helper <*> opts) $ progDesc desc
infoString
= "Run the mcc compiler on the given file. \
\Passing no flags will compile the file, execute it, and print the output."
runOpts :: Options -> IO ()
runOpts (Options action infile ptype) = do
program <- T.readFile infile
let parseTree = case ptype of
Combinator -> runParser programP infile program
Generator -> Right $ parse . alexScanTokens $ T.unpack program
case parseTree of
Left err -> putStrLn $ errorBundlePretty err
Right ast -> case action of
Ast -> putDoc $ pretty ast <> "\n"
_ -> error "Not yet implemented"
(Full source for app/Main.hs
here. Note that it is for the complete compiler including the passes that we haven't yet discussed.)
Testing the compiler
For our compiler, we will primarily test with golden tests provided by tasty-golden. Golden tests are essentially files containing the expected outputs of running various commands. In our case, we will have two directories in tests, tests/pass
for programs that are expected to run successfully and produce output and tests/fail
for programs with semantic errors that we hope to emit moderately helpful error messages for. Each directory contains pairs of files cool_program.mc
and cool_program.golden
with the source and expected output of running mcc cool_program.mc
, respectively.
However, since we have yet to write the semantic analyzer or code generation passes, we can't test them yet. All we can test at this stage is that our combinator and generator parsers produce identical ASTs when successful and that neither parser succeeds where the other one fails. We won't attempt to test whether they return the same error messages on failure because rewriting the parser generator to emit real error messages is very time consuming and would require an overhaul of its entire structure. Instead, we resort to lifting the parser generator functions into IO
where we can catch exceptions (yes, haskell does allow this, although it's rarely done).
We construct a test tree by going through the pass
directory and running both parsers on each file.
module Main where
import Test.Tasty ( defaultMain
, TestTree
, testGroup
)
import Test.Tasty.Golden
import Test.Tasty.HUnit
import System.FilePath ( takeBaseName
, replaceExtension
)
import Microc
import Data.String.Conversions
import qualified Data.Text.IO as T
import Data.Text ( Text )
import Data.Text.Prettyprint.Doc
import Data.Text.Prettyprint.Doc.Render.Text
import Control.Monad
import Control.Exception
parsing :: IO TestTree
parsing = do
files <- concat <$> mapM (findByExtension [".mc"])
["tests/pass", "tests/fail"]
fmap (testGroup "parsing") $ forM files $ \file -> do
input <- T.readFile file
combinator <- pure $ runParser programP file input
generator <-
try . evaluate . parse . alexScanTokens $ cs input :: IO
(Either IOError Program)
pure . testCase file $ case (combinator, generator) of
(Right ast, Right ast') -> assertEqual file ast ast'
(Left _ , Left _ ) -> pure ()
_ -> assertFailure file
main :: IO ()
main = defaultMain =<< parsing
Once we've implemented more of the compiler, we'll extend the test runner with more test groups. Using Tasty's testGroup functionality, we can restrict our test run to only some of the tests by running them with nix-shell --pure --run "cabal new-run testall -- --pattern 'parsing'"
to only run the 'parsing' tests, for example.
(Full source for tests/Testall.hs
here. Same as with Main
, contains material to be covered in future posts.)
Addendum: My informal comparison of parser combinators vs. generators
Combinators | Generators |
---|---|
Easier to write - vanilla Haskell | Custom, weird syntax that no one ever remembers offhand |
Decent default error messages | Good parse errors are possible, but they take significant work |
Lots of tutorials and blog posts | Scarce documentation aside from the official manuals |
Works on String, Strict/Lazy ByteString, and Text | Only String and Lazy ByteString |
Grammar sensitive to ordering of alternatives - need to be careful about exploding runtime with try | Robust to ordering of alternatives - automatically uses longest match |
No way to tell if grammar is ambiguous | You'll get a lovely Shift-Reduce conflict if your grammar has any ambiguities |
Used by hnix, Idris | Used by GHC |
Performance wise, it's unclear to me which one is faster, as I haven't benchmarked anything. My guess is that megaparsec can run very quickly with careful optimization and minimizing the use of try
whereas alex/happy is pretty fast by default. In the current state of the codebase, it would be very unfair to compare the two, as megaparsec reads Text
whereas alex reads String
. A fair comparison would have to use ByteString
.
Anyway, thanks for reading until the end! Stay tuned for part 2, semantic analysis.
Even though this is a toy compiler, we'll use Text
instead of String
because you should never use String
. Since we won't explicitly support unicode in the source and real C certainly doesn't either, we could use ByteString
but that's not really the type we want to represent human-readable text.
In a future post, I might try defining an Arbitrary
instance for the AST, so we can check that pretty_print . parse =
id= with property testing.
I'd be really surprised to see this happen in practice, as no one is writing gigantic microc files (I hope), but it's good form to mention this pitfall anyway.
Note that it's possible to write happy parsers without dictating precedence rules at all, but instead breaking up different precedence operators into separate rules, with rules concerning higher precedence operators coming later. However, this involves a lot of duplication, and some things like prefix negation are tricky to get right like that, so it's generally easier to use the precedence tables when they work. A common case where they lead to ambiguity, though, is when parsing function application by juxtaposition, like in haskell. For that, you either need to tokenize whitespace and use it as a high precedence operator, or construct special fallthrough rules.
I know I said we shouldn't use String, but alex doesn't support Text, only String and lazy ByteString, so I picked the semantically correct option over the performant one.