Build a Compiler — IR Generator

Marc Auberer
4 min readApr 2, 2024

--

In order to effectively optimize the program it can be beneficial to transform the program to a so called “Intermediate Representation” or IR for short. IR serves as an abstraction level between the compiler frontend with its high level language constructs and the low level constructs like machine instructions. IR is platform agnostic and is especially designed for facilitating optimizations. This helps the optimizer to do its work, independent from the source language and also independent from the requirements of the actual target machine. IR can have different forms. LLVM uses the Static Single Assignment form (SSA), which has a simple concept: Never assign a variable (virtual register) twice. This makes it particularly easy for a compiler to detect unused values and perform other optimizations. Here is an example to see LLVM IR and SSA in action:

int fibo(int n) {
if (n <= 1) { return n; }
return fibo(n - 1) + fibo(n - 2);
}

Clang 18.1.0 produces the following IR for this [CE]:

define dso_local noundef i32 @fibo(int)(i32 noundef %n) {
entry:
%retval = alloca i32, align 4
%n.addr = alloca i32, align 4
store i32 %n, ptr %n.addr, align 4
%0 = load i32, ptr %n.addr, align 4
%cmp = icmp sle i32 %0, 1
br i1 %cmp, label %if.then, label %if.end

if.then: ; preds = %entry
%1 = load i32, ptr %n.addr, align 4
store i32 %1, ptr %retval, align 4
br label %return

if.end: ; preds = %entry
%2 = load i32, ptr %n.addr, align 4
%sub = sub nsw i32 %2, 1
%call = call noundef i32 @fibo(int)(i32 noundef %sub)
%3 = load i32, ptr %n.addr, align 4
%sub1 = sub nsw i32 %3, 2
%call2 = call noundef i32 @fibo(int)(i32 noundef %sub1)
%add = add nsw i32 %call, %call2
store i32 %add, ptr %retval, align 4
br label %return

return: ; preds = %if.end, %if.then
%4 = load i32, ptr %retval, align 4
ret i32 %4
}

declare void @llvm.dbg.declare(metadata, metadata, metadata) #1

This is a simple function that returns the nth fibonacci number with n as input. As you can see, the IR is structured in hierarchical order. The outermost element is the IR module, which is usually a source file / compile unit. Modules can contain one or several functions, which take arguments and produce a result value, similar to higher level languages. Functions consist of one or more so called basic blocks which in turn contain one or multiple IR instructions. Basic blocks are executed linearly and have to end with a terminator instruction, i.e. a branch instruction (br) aka jump, a return instruction or similar. Terminator instructions in the middle of a basic block are not allowed and would make the IR ill-formed.

IR2Vec: A Flow Analysis based Scalable Infrastructure for Program Encodings — Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Building-blocks-of-LLVM-IR_fig1_335833400 [accessed 31 Mar, 2024]

Import instructions in LLVM IR are:

  • alloca:
    Allocates a variable in the stack frame of the function. The number of bytes to allocate corresponds to the size of the given type. The resulting virtual register is of type ptr, which means it is an opaque pointer to the variable on the stack.
  • store:
    Writes a value to a virtual register, which identifies a location in memory. The value to store can be either an immediate or another virtual register.
  • load:
    Inverse operation to store. Reads a value from a memory location and puts it into a virtual register.
  • br:
    Jumps to another basic block unconditionally or branch to one of two basic blocks, depending on the value of a given condition.
  • call:
    Calls another function by optionally passing arguments. The return value of the callee will end up in another virtual register.

A full list of instructions can be found in the longish LLVM LangRef.

Exercise

LLVM offers a simple to use C++ API to generate LLVM IR. To generate IR for a compile unit, you need three global resources:

  • llvm::LLVMContext context
  • llvm::IRBuilder<> builder = llvm::IRBuilder<>(context)
  • llvm::Module* llvmModule

Create the two files IRGenerator.h and IRGenerator.cpp in the subdirectory src/irgenerator/. As usual, create a class that inherits from CompilerPass and ASTVisitor called IRGenerator and initialize the global LLVM resources in the constructor. Create three helper functions:

  • llvm::Value* insertAlloca(llvm::Type*, const std::string&)
  • llvm::Function* getPrintfFct()
  • llvm::Function* getFunction(const char*, llvm::Type*, llvm::ArrayRef<llvm::Type*>, bool)

The API is pretty intuitive. As we have a linear control flow in our toy language and do not support functions, we only need the main function, that will be called by the operating system in order to run your program. For that, you need some function setup code in the visitEntry method. For the rest of the visit methods, you can use the builder to create IR instructions, which should be pretty self-explanatory. For the visitPrintCall method you need the printf function from the C standard library. If you declare the function using the LLVM API, the linker will lookup the function for you later on and it should work out of the box. Here is the finished implementation.

Continue reading the next article

--

--

Marc Auberer

C++ Compiler Developer at SAP. Passionate software developer and OpenSource enthusiast.