Notes

LLVM IR

LLVM IR can be represented in three forms:

  • compact mode, called bitcode (.bc)
  • human readable representation (.ll)
  • in memory through C++ objects

The IR can be compiled to anyy supported architecture and whatever LLVM support in terms of analysis can be done on it.

Identifiers

Two types of identifiers:

  • @ for global identifiers (function name, global variables)
  • % for local identifiers (register names, types)

Identifiers can be either named or unnamed (represented b an unsigned numeric value).

note

Constant values using different syntax based on the data type of a constant.

Function

A function definition consists of a header and a body. The header provides information about name, type and number of parameters, attributes. Attributes are used to communicate additional information abouta function, does not affect the semantics.

The body of a function is made up of multiple basic blocks, these form a control flow graph.

A basic block starts with a label (its name). This labels can be used to reference the block (e.g. in a phi block or branching instructions). The body of a basic block is a contiguous sequence of instructions without branching.

The last instruction in a basic block must be a terminator which is, for example, an instruction that explicitly transfers control flow to a different basic block (a branching instruction) or exits the function (return instruction). The instructions inside each basic block operate on values in virtual registers, or they can move values between registers and memory.

Each instruction has at most one return value, which is always assigned to a new virtual register for the duration of execution of a function.

Example

Given a simple C function:

int max(int a, int b) {
  if (a > b)
    return a;
  else
    return b;
}

The corresponding LLVM IR (clang -S -emit-llvm):

define dso_local i32 @max(i32 noundef %0, i32 noundef %1) #0 {
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  %5 = alloca i32, align 4
  store i32 %0, ptr %4, align 4
  store i32 %1, ptr %5, align 4
  %6 = load i32, ptr %4, align 4
  %7 = load i32, ptr %5, align 4
  %8 = icmp sgt i32 %6, %7
  br i1 %8, label %9, label %11

9:                                                ; preds = %2
  %10 = load i32, ptr %4, align 4
  store i32 %10, ptr %3, align 4
  br label %13

11:                                               ; preds = %2
  %12 = load i32, ptr %5, align 4
  store i32 %12, ptr %3, align 4
  br label %13

13:                                               ; preds = %11, %9
  %14 = load i32, ptr %3, align 4
  ret i32 %14
}

SSA

note

Constants are a special case, with static single assignment form is a type of IR where each variable is assigned exacly once.

Virtual registers are in SSA form, as they are assigned (defined) only once. From the entire module only virtual registers are in SSA form, which means the module itself is in partial SSA form. This makes them a suboptimal representation of local variables. To solve this problem, LLVM uses the alloca instruction. alloca creates separate space to store address-taken variables and returns a pointer, it is possible to operate on this pointer as well as the value inside the allocated memory itself (load, store instructions). These values are not in SSA form.

SSA requires phi instructions. These instructions are used in the beginning of a the basic block to merge values from different basic blocks that are predecessors of the curent one. The phi node createa new register with a value dependent on which of the predecessors transferred control to this block.

; %15 = 10 || %15 = %20
%15 = phi i64 [ 10, %first_bb ], [ %20, %second_bb ]

Types

LLVM has types, there are basic types but LLVM support also more complex types such as: arrays [i32 x 10], structures (struct or classes), etc…

; Struct type
%type = { i32, f32, %another_type }

Elements of structures in registers are accessed using insertvalue and extractvalue instructions which takes at least one integer as an argument to calculate offset. The integer specifies the offset of the field member in the structure (0 is the first member field, 1 the second, etc.). Following example retrieves second member of a structure that is in register %12:

%second = extractvalue { i32, i32 } %12, 1

Symmetrically to retrieve an element of structure in memory, gep (getelementptr) combined with load, store instructions is used.

Pointer type can be casted to a different pointer type using the bitcast instruction. The inttoptr and the ptrtoint instructions allow conversion between pointers and integer types. Casting integer types to integer types of different size is done by zext (zero extend), sext (signed extend) and trunc, depending on the size and signedness of the source and target types.