Build a Compiler — Lexer

Marc Auberer
4 min readApr 2, 2024

--

Before starting to think about our lexer, we should first define what our demo language is supposed to do. Let’s say we just want a language, that allows declaring variables and do calculations, using the four basic arithmetic operations +, -, * and /. Variables can either be of type int or type double. Also, the language should allow us to print a value by offering a builtin function called print , that either takes an int or a double variable/immediate as argument.

What is a lexer?

Let’s say we have the following code snippet as test input to our compiler:

int d = a + b * c;

The first compile step, the lexing phase, focuses on the syntactic understanding of the code. It takes the source text as input and produces a stream or list of so-called “tokens”. Tokens are the representation of a single, atomic syntactic unit for a particular language. In natural languages, this would correspond to the concept of words within a text. All keywords, operators and data types are tokens, but also literals and identifiers are considered as tokens.
Furthermore, the lexer may remove unwanted noise like whitespaces or comments from the input, so only the relevant information remains.

Here’s the tokenized version of the above example with | as separator:

int|d|=|a|+|b|*|c|;

All tokens must consist of the token type, the token text and maybe also the code location, where the token lives in the input source file. This is for being able to print nicer error messages in the later compile phases.
As the token type has a finite alphabet, an enumeration can be used as data type, the token text can be saved as string and the code location may be encoded in some way e.g. L3C9 for line 3, column 9.

Here is the list of tokens, which will be produced for the given example:

(TOK_INT, "int", CodeLoc(1, 1))
(TOK_IDENTIFIER, "d", CodeLoc(1, 5))
(TOK_ASSIGN, "=", CodeLoc(1, 7))
(TOK_IDENTIFIER, "a", CodeLoc(1, 9))
(TOK_PLUS, "+", CodeLoc(1, 11))
(TOK_IDENTIFIER, "b", CodeLoc(1, 13))
(TOK_MUL, "*", CodeLoc(1, 15))
(TOK_IDENTIFIER, "c", CodeLoc(1, 17))
(TOK_SEMICOLON, ";", CodeLoc(1, 18))

Exercise

Now that we know what the lexer should do, let’s setup our compiler project to implement the lexer in C++.

You can start off by creating a GitHub repository from the provided template to get off the ground. Simply click on the button Use this template -> Create a new repository to create a GitHub repository under your name, based on this template.

If you are stuck or want to see how the compiler will look at the end of the series, you can take a look at the main repo.

Overview

As the lexer should be only responsible for tokenizing the input stream, we need another abstraction layer that the lexer can work with, which is the reader. The reader is responsible for opening and holding the file handle, reading the contents of the source file char by char and providing an interface for the lexer. The lexer then uses the interface to request chars and whenever a char sequence matches a known token, this token gets written to the token output stream.

Reader

Let’s start implementing the reader component for our toy compiler. Create the four files Reader.h and Reader.cpp as well as CodeLoc.h and CodeLoc.cpp in the subdirectory src/reader/.

Create the struct CodeLoc in CodeLoc.h, that saves a line and column number and offers a print() method, that returns the line and column number as string.

For the reader, create the class Reader in Reader.h, that offers the following methods for interacting with the lexer:

  • char getChar()
  • CodeLoc getCodeLoc()
  • void advance()
  • void expect(char)
  • bool isEOF()

Implement the logic yourself. If you have problems, you can take a look at the main repo as mentioned above.

Lexer

The lexer now uses the reader to read the source file char by char and construct tokens from the char input stream. This works in a lazy way, so that the reader does not push a new char to the lexer, but the lexer needs to request a new char from the reader.

Create the three files Token.h, Lexer.h and Lexer.cpp in the src/lexer/ subdirectory.

Token.h should simply contain an enumeration TokenType of all known token types and a Token struct, that stores the token type, the token text and a CodeLoc object holding information on exactly where the token can be found in the source file.

Lexer.h contains the class Lexer that offers the following methods to later incorporate with the parser:

  • const Token& getToken()
  • void advance()
  • void expect(TokenType)
  • void expectOneOf(const std::initializer_list<TokenType>&)
  • bool isEOF()
  • CodeLoc getCodeLoc()

Furthermore, it should hold an instance of the Reader and needs to keep track of the current token via a member variable.

With this done, our Lexer and Reader should work. You can try it out by running the reader and lexer together on a demo input file and print the lexed tokens to the console.

Continue reading the next article

--

--

Marc Auberer

C++ Compiler Developer at SAP. Passionate software developer and OpenSource enthusiast.