18 Sep

Adding type identifiers (MPC -> LLVM for the Neil Language #3)

In my previous post cleaning up the parser, I had rationalised the parser for my custom language I’m calling Neil – Not Exactly an Intermediate Language.

The next step in making my language more useful and expression is to add the ability to arbitrarily create variables within a function, store an initial value to the variable, and then be able to update the variable.

Previously, our language could handle:

i32 foo(i32 x) {
  return x * 5 + 42;
}
i32 main() {
  return foo(13);
}

Whereas now we want it to be able to handle:

i32 foo(i32 x) {
  i32 y = x * 5;
  y = y + 42;
  return y;
}
i32 main() {
  return foo(13);
}

In GitHub Pull Request #1 I’ve submitted the patch of changes required to get this to work, but lets cover the steps that were required to do this!

MPC Grammar Changes

We firstly need to modify our MPC grammar to accept type declarations within the body of a function. Our functions (procedure in the grammar) are made up of zero or more statements. We modify the stmt definition in the grammar to be:

 stmt : \"return\" <lexp>? ';' 
      | <ident> '(' <ident>? (',' <ident>)* ')' ';'
      | <typeident> ('=' <lexp>)? ';' 
      | <ident> '=' <lexp> ';' ;

As shown in red, we’ve added two new statement variants:

  • a type identifier followed by an optional initializer
  • assigning into an existing identifier

With these two changes, we can represent all the fun variable shenanigans we could imagine.

AST Lowering

We need to handle two extra cases in our lower_statement method now – lowering type definitions within functions, and assignments into already defined identifiers.

To detect the type identifier case, we simply:

  • add some logic to check if the first child of a statement AST node is a typeident
  • the first child of a type identifier is the type, which we lookup in the type table
  • the second child of a type identifier is the name of the variable, which we’ll use to assign an entry in the symbol table
  • we then use LLVM’s stack allocation instruction alloca to create memory for our variable to reside within
  • and lastly store into the symbol table, using the variable name we already parsed, that the name maps to the alloca’ed variable

We then need to check whether the variable had an initializer or not:

  • if the statement began with a type identifier AST node, and had exactly two children, we know that there was no initializer (so can skip the following logic)
  • otherwise, the statement must have exactly four children (a type identifier, an ‘=’ character, an expression, and lastly a ‘;’ character)
  • we use the lower_ast_node method to lower the expression, passing in the type we deduced from the type identifier
  • and lastly use LLVM’s store instruction to assign the value of the expression into the memory backing the variable

To detect when we want to assign into an existing variable:

  • we know a variable declaration is an identifier, followed by an ‘=’ character, followed by an expression, and lastly a ‘;’ character
  • we then lookup the variable in the symbol table
  • we use the lower_ast_node method to lower the expression on the right-hand-side of the ‘=’ character, passing in the type of the type identifier (we use LLVMTypeOf on the value to deduce the type)
  • and lastly we use LLVM’s store instruction to assign the value of the expression into the memory backing the variable

Easy!

Modifying Expression Lowering

We need to slightly modify expression lowering to. Previously all values in the symbol table were directly correlated to the type used to declare the value in the source language. For example, a function parameter with type i32 would have the LLVM type i32 to match it. Now that we have function local variables that have a stack allocation backing them, we need to deduce when we find an identifier, if it was an identifier that came from a function local variable, or an identifier that came from a function parameter. In lower_ident, we now have an extra check to see if the value in the symbol table is an alloca instruction, EG. it requires to be loaded before we can use it in a further expression. If we find that it is an alloca, we simply use LLVM’s load instruction on it, and return this new LLVM value instead of the original value in the symbol table.

Remembering the Symbols to Murder

When we close a function, we would loop through the named parameters of the function and remove them from the symbol table. These symbols were only usable within the body of the function so we had to ensure we didn’t accidently extend their lifetime beyond the function close. This also now applies to variables declared within a function too! Now, to add a little complexity, since variables can be declared in any statement of a function, we need to keep a list of symbols that need to be destroyed when the function closes. We use a new NeilVector (at present just a wrapper around std::vector) to keep track of which symbols we need to destroy when we close the function. When calling lower_statement and having detected a new type identifier definition, we push back the name of the symbol that we need to cleanup onto our symbol cleanup vector, and then when the function closes we just loop through this vector, and erase the corresponding symbol in the symbol table.

What is Next?

So now we’ve added function local variables to the language, it is about time that we tackled a big lacking feature – branching! In the next post we’ll cover how we handle adding if statements to the language. Stay tuned!