Build a Compiler — Symbol Table

Marc Auberer
4 min readApr 2, 2024

--

Now that we know that our input program is syntactically correct, we can enter semantic analysis territory. One of the most important rules that most programming languages enforce, is the correct use of identifiers. For example, the compiler needs to make sure, that variable a can only be read from if it was declared earlier. Or it needs to prohibit declaring two symbols with the same name in the same scope.

For that kind of checks, we need to keep track of which symbol names exist in which lexical or semantic scope. In practice, compilers generate so-called symbol tables for that. It depends on the implementation whether a single big symbol table is used or if there are several smaller symbol tables for each scope.

Here is a piece of sample code written in C:

int main() {
int a = 123;
int b = 321;
if (a % 2 == 0) {
int c = a + b;
b = c * a;
}
int c = b - a + 1;
return 0;
}

A compiler could produce a symbol table similar to this one:

scope     | symbol name | type | declNode | used
-------------------------------------------------
root | main | ? | <ptr> | true
main.l1c0 | a | ? | <ptr> | true
main.l1c0 | b | ? | <ptr> | true
if.l4c2 | c | ? | <ptr> | true
main.l1c0 | c | ? | <ptr> | false

While filling the symbol table with entries, we can check if the scope, where we are about to add a new entry, already contains an entry with the same name. If this is the case, we trigger a compile error.

The data types of the symbol entries are unknown in this stage and we don’t need to know them at this point, because all type-related checks will be performed by the type checker. The symbol table is also a good place to track the usage of the individual variables. For instance, a warning could be triggered when the compiler sees some variable left unused.

Some compilers, like the Rust compiler uses the symbol table to keep track of the lifetimes of symbols. This can be useful for further checks down the pipeline, namely borrow checking, which we don’t cover in this article series.

Many languages support “shadowing” of variables. Let me illustrate this with an example:

int main() {
int a = 123;
int b = 0;
if (a % 2 == 0) {
int a = 456;
b = a; // <- Which one of the a's should be used?
}
}

In C, the innermost symbol with the desired name, visible to the consumer, is always preferred. This means that the compiler will start searching for candidate symbols in the scope where the assignment b = a; happens and will find symbol “a” there. If there was no symbol “a” in this scope, the compiler would continue searching for the symbol in the parent scope and so on.

For our little toy programming language this is irrelevant because we do not support language constructs that provide additional scopes. We only have one single scope, the root scope, to work on.

Exercise

Let’s extend our small compiler with a symbol table implementation. For that, first create the eight source files SymbolType.h, SymbolType.cpp, SymbolTableEntry.h, SymbolTableEntry.cpp, SymbolTable.h, SymbolTable.cpp, SymbolTableBuilder.h and SymbolTableBuilder.cpp in the subdirectory /src/symboltable/. In SymbolType.h create a class SymbolType, which will act as a container for a type of an arbitrary expression. As mentioned before, for the symbol super type, we can use an enumeration. Add the enum SymbolSuperType, with the possible super types TY_INVALID, TY_INT and TY_DOUBLE to the top of the header file.

Add the following three methods to the class:

  • bool is(SymbolSuperType) const
  • bool isOneOf(const std::initializer_list<SymbolSuperType>&) const
  • std::string getName() const

Optionally, you can add the operator overloading functions to compare two symbol type classes:

  • friend bool operator==(const SymbolType&, const SymbolType&)
  • friend bool operator!=(const SymbolType&, const SymbolType&)

And finally, add a member of type SymbolSuperType, called superType to the class. Implement the definitions of the declared functions in the cpp file.

Also add a plain old data class (POD), called SymbolTableEntry, that contains the members:

  • const std::string name
  • SymbolType type (with default value TY_INVALID)
  • ASTNode* declNode

Now we need a table that can be assigned to each scope and contains a mapping from symbol name to SymbolTableEntry. For that, create the class SymbolTable in SymbolTable.h. The class contains two members:

  • SourceFile* sourceFile
  • std::unordered_map<std::string, SymbolTableEntry> symbols

Also add these two methods. They will serve as an interface for the compiler to insert and lookup symbols:

  • SymbolTableEntry* insert(const std::string&, ASTNode* declNode)
  • SymbolTableEntry* lookup(const std::string&)

Now that we have the boilerplate code in place, we can write the actual compiler pass, the SymbolTableBuilder, that does the AST visit and fills the symbol table with symbols. Add the class SymbolTableBuilder in SymbolTableBuilder.h and let it inherit from CompilerPass and ASTVisitor. Implement the necessary visit methods. Here you can find an example solution to compare with.

Continue reading the next article

--

--

Marc Auberer

C++ Compiler Developer at SAP. Passionate software developer and OpenSource enthusiast.