Build a Compiler — Parser

Marc Auberer
4 min readApr 2, 2024

--

Next, we want our compiler to gain understanding of the context, in which the individual tokens are aligned, so that we can derive what the program looks like syntactically. For this we can use a so-called parser.

What is a parser?

A parser transforms a given token stream to an abstract syntax tree (AST for short), which represents the source code in a tree structure and can be enriched with information for/by later compile phases. The parser needs a grammar to work with. For compilers, this is a context-free (Chomsky type 2) set of rules, which depend on one another.

For our example language, this would be the grammar, noted in extended Backus-Naur form (EBNF). # is the end of file here:

entry ::= stmtLst #
stmtLst ::= stmt*
stmt ::= (declStmt | printCall) TOK_SEMICOLON
declStmt ::= dataType TOK_IDENTIFIER TOK_ASSIGN additiveExpr
additiveExpr ::= multiplicativeExpr ((PLUS | MINUS) multiplicativeExpr)?
multiplicativeExpr ::= atomicExpr ((MUL | DIV) atomicExpr)?
atomicExpr ::= constant | TOK_IDENTIFIER | TOK_LPAREN additiveExpr TOK_RPAREN
constant ::= TOK_INT_LIT | TOK_DOUBLE_LIT
printCall ::= TOK_PRINT TOK_LPAREN additiveExpr TOK_RPAREN
dataType ::= TOK_TYPE_INT | TOK_TYPE_DOUBLE

Here, all symbols on the left side of the ::= , like stmt or atomicExpr are called non-terminal symbols. All symbols in caps, like ASSIGN or DOUBLE_LIT are called terminal symbols. Terminal symbols are essentially tokens, so they are part of the input. Non-terminals are rules to match any number of terminal symbols. In the above grammar, the start symbol is the non-terminal entry.

In our example project, we will use a recursive descent parser, which is a top-down LL(k) parser. k can be obtained by analyzing the grammar for rule ambiguities. If each and every rule begins with a different terminal symbol, there are no ambiguities at all and we have a LL(1) grammar. In a case where we have two or more rules starting with the same terminal symbol, we need a lookahead of 2 to undoubtedly recognize a rule by means of the input tokens, so we have a LL(2) grammar.

To know if we have any rule ambiguities, we can leverage the so-called first, follow and selection sets.

First Set

Each non-terminal has it’s own first set. Here is the definition:

The first set contains all terminal symbols that can start a grammar rule. If the rule can derive to epsilon (ε), epsilon is also included in the first set. Terminal symbols can also have first sets. The first set of a terminal is always the terminal itself.

Here you can see the first set of a few rules from above:

First(stmtLst) = {TYPE_INT, TYPE_DOUBLE, PRINT, ε}
First(stmt) = {TYPE_INT, TYPE_DOUBLE, PRINT}
First(atomicExpr) = {INT_LIT, DOUBLE_LIT, IDENTIFIER, LPAREN}
First(LPAREN) = {LPAREN}

As an exercise, determine the first set for the rest of the rules. See here for the solution.

Follow Set

In extension to the first sets, we can also determine which terminal symbols follow each parse rule, by building the follow set of the corresponding non-terminal symbol.

The follow set of non-terminal X contains all terminal symbols, which are included in a first set of a symbol that follows X. If X can be the last symbol of the input, the follow-set also includes the EOF (#) terminal.

Follow(entry) = {#}
Follow(multiplicativeExpr) = {PLUS, MINUS, SEMICOLON, RPAREN}
Follow(printCall) = {SEMICOLON}

Again, as an exercise, determine the follow set for the missing rules. See here for the solution.

Selection Set

The selection set is a combination between the first and follow set and can be used by a parser to decide upon a rule to match a given input with.

If the first set of X does not contain epsilon, the selection set of X is equivalent to the first set of X. If this is not the case, the selection set of X is the first set of X excluding epsilon, combined with the follow set of X.

Select(stmtLst) = First(stmtLst)\ε U Follow(stmtLst) = {TYPE_INT, TYPE_DOUBLE, PRINT, #}
Select(declStmt) = First(declStmt) = {TYPE_INT, TYPE_DOUBLE}
Select(additiveExpr) = First(additiveExpr) = {INT_LIT, DOUBLE_LIT, IDENTIFIER, LPAREN}

Once again, determine the selection set for all remaining rules. See here for the solution.

Exercise

Now, let’s apply the learned theory in practice and extend our toy language project with a parser component.

AST nodes

Before implementing the parser, we first have to provide the AST nodes, that the parser should instantiate. For that, we create ASTNodes.h in the subdirectory src/ast/. Create a class for each non-terminal in our grammar, e.g. ASTStmtNode or ASTMultiplicativeExprNode. All node classes should inherit from a common base class ASTNode. This base class has the following members:

  • ASTNode* parent
  • std::vector<ASTNode*> children
  • CodeLoc codeLoc
  • SymbolType symbolType

The base class also offers at least the following methods:

  • void addChild(T*)
  • T* getChild(size_t)
  • std::vector<T*> getChildren()

Parser

To implement the parser for our example language, create the two files Parser.h and Parser.cpp in the subdirectory src/parser/.

In Parser.h, create the class Parser. Create a method for each non-terminal symbol in our grammar, like e.g. parseDeclStmt() or parsePrintCall(). Implement these methods by consuming the tokens from the Lexer. For each parsed rule, the respective AST node should be created, so that the entry node’s parse method returns the pointer to the complete AST when parsing is finished. For creating and concluding AST nodes, you might find it useful to create two methods createNode() and concludeNode(T* node) and call those from the individual parse methods.

Feel free to look at the finished parser solution if you are done or experience any problems.

Continue reading the next article

--

--

Marc Auberer

C++ Compiler Developer at SAP. Passionate software developer and OpenSource enthusiast.