LLVM IR
LLVM IR can be represented in three forms:
- compact mode, called
bitcode(.bc) - human readable representation (.ll)
- in memory through C++ objects
The IR can be compiled to anyy supported architecture and whatever LLVM support in terms of analysis can be done on it.
Identifiers
Two types of identifiers:
@for global identifiers (function name, global variables)%for local identifiers (register names, types)
Identifiers can be either named or unnamed (represented b an unsigned numeric value).
note
Constant values using different syntax based on the data type of a constant.
Function
A function definition consists of a header and a body. The header provides information about name, type and number of parameters, attributes. Attributes are used to communicate additional information abouta function, does not affect the semantics.
The body of a function is made up of multiple basic blocks, these form a control flow graph.
A basic block starts with a label (its name). This labels can be used to
reference the block (e.g. in a phi block or branching instructions). The
body of a basic block is a contiguous sequence of instructions without
branching.
The last instruction in a basic block must be a terminator which is, for example, an instruction that explicitly transfers control flow to a different basic block (a branching instruction) or exits the function (return instruction). The instructions inside each basic block operate on values in virtual registers, or they can move values between registers and memory.
Each instruction has at most one return value, which is always assigned to a new virtual register for the duration of execution of a function.
Example
Given a simple C function:
int max(int a, int b) {
if (a > b)
return a;
else
return b;
}
The corresponding LLVM IR (clang -S -emit-llvm):
define dso_local i32 @max(i32 noundef %0, i32 noundef %1) #0 {
%3 = alloca i32, align 4
%4 = alloca i32, align 4
%5 = alloca i32, align 4
store i32 %0, ptr %4, align 4
store i32 %1, ptr %5, align 4
%6 = load i32, ptr %4, align 4
%7 = load i32, ptr %5, align 4
%8 = icmp sgt i32 %6, %7
br i1 %8, label %9, label %11
9: ; preds = %2
%10 = load i32, ptr %4, align 4
store i32 %10, ptr %3, align 4
br label %13
11: ; preds = %2
%12 = load i32, ptr %5, align 4
store i32 %12, ptr %3, align 4
br label %13
13: ; preds = %11, %9
%14 = load i32, ptr %3, align 4
ret i32 %14
}
SSA
note
Constants are a special case, with static single assignment form is a type of IR where each variable is assigned exacly once.
Virtual registers are in SSA form, as they are assigned (defined) only
once. From the entire module only virtual registers are in SSA form,
which means the module itself is in partial SSA form.
This makes them a suboptimal representation of local variables. To solve
this problem, LLVM uses the alloca instruction. alloca creates separate
space to store address-taken variables and returns a pointer, it is
possible to operate on this pointer as well as the value inside
the allocated memory itself (load, store instructions).
These values are not in SSA form.
SSA requires phi instructions. These instructions are used in the beginning
of a the basic block to merge values from different basic blocks that are
predecessors of the curent one. The phi node createa new register with
a value dependent on which of the predecessors transferred control to this block.
; %15 = 10 || %15 = %20
%15 = phi i64 [ 10, %first_bb ], [ %20, %second_bb ]
Types
LLVM has types, there are basic types but LLVM support also more complex
types such as: arrays [i32 x 10], structures (struct or classes), etc…
; Struct type
%type = { i32, f32, %another_type }
Elements of structures in registers are
accessed using insertvalue and extractvalue instructions which takes
at least one integer as an argument to calculate offset. The integer
specifies the offset of the field member in the structure (0 is the
first member field, 1 the second, etc.).
Following example retrieves second member of a structure that is in
register %12:
%second = extractvalue { i32, i32 } %12, 1
Symmetrically to retrieve an element of structure in memory, gep (getelementptr) combined with load, store instructions is used.
Pointer type can be casted to a different pointer type using
the bitcast instruction. The inttoptr and the ptrtoint instructions
allow conversion between pointers and integer types. Casting integer
types to integer types of different size is done by zext (zero extend),
sext (signed extend) and trunc, depending on the size and signedness
of the source and target types.