01 Jun

So long Codeplay, thanks for all the fish!

After 11 years since my first internship, and 9 years full time, I’m leaving Codeplay to join AMD.

A Brief History of Codeplay and I

I was a nineteen year old ned when I started my first internship at Codeplay, and was given the code-monkey job of hand optimizing shaders for an OpenGL ES 2.0 shader compiler, so that Codeplay’s compiler engineers could prove the capability of the compiler. Slated to take 10 weeks, I finished the task in 2 weeks and to stave off boredom in the remaining 8 weeks I ended up proving 100% code coverage of the compiler and writing a test generator to break it.

From these humble beginnings I came back for a second internship where I rewrote the Testplay system so that it’d work on Windows & Linux.

When I joined full time, it was with the promise that I’d be doing PlayStation 3 development. I shipped my first title (NASCAR The Game: 2011) and got to make some SPUs very happy in the process. I felt at home doing games technology, but Codeplay’s market drifted from primarily in games to doing driver development.

For 5 years I shipped ~7 OpenCL implementations in the mobile & embedded space, becoming an expert in integrating and optimization in LLVM in the process.

Then Vulkan came around, and I knew here was the thing I had been waiting for. Having worked for so long on OpenCL, the deep-seated flaws in the API were very clear to me, and I wanted to be sure that Vulkan did not make the same mistakes. After joining Khronos, I proceeded to get Vulkan 1.0 out the door, then focused primarily on getting the Vulkan 1.1 subgroup functionality shipped and into developers hands.

AMD’s Enticing Offer

With my history in Vulkan & SPIR-V, and my LLVM & driver experience – AMD enticed me with an awesome job. I’d get to spend 100% of my time doing Vulkan & SPIR-V, I’d get to work on the Vulkan driver, I’d get to use my skills in LLVM, and I’d still be able to work within Khronos to get extensions and functionality that I know developers will love.

The entire brief was summed up by Rys as ‘Make things fast anywhere you want!’.

The icing on the cake was the opportunity to work for Rys Sommefeldt and closely with Timothy Lottes – two people that are legends in the industry to me.

What more was there to say other than ‘Sign me the fuck up!’

TL;DR

So today, the 1st of June 2018 is my last day at Codeplay. I leave behind many friends and co-workers that I’ll naturally miss, I wish you all the best in your continuing careers!

And on the 18th June 2018 I’ll start the next chapter in my life at AMD. Super excited doesn’t begin to cover it.

11 May

1.22x Performance Improvements in json.h!

I’ve been meaning to try out Intel’s VTune now that it has a permanent trial mode (you can renew the 30-day trial license every 30-days without paying!), as I’ve long heard people rave about it as a profiling aid. I decided to pick one of my open source C libraries and see what performance problems I could find in the code, and I choose my JSON parser json.h.

I’ve long known that the recursive approach I used in the parser will limit the performance I can get out of the code as compared to some of my competitors, but I support a metric-ton of extensional functionality (simplified JSON and JSON5 to name two major extensions that my users love). That being said, while I wasn’t in the mood to replace my recursive strategy with a more optimal looping variant, I did want to see if there were any easy wins to be had in the code.

I decided to use the large citylots.json test file as my benchmark file, and parsed the file twice just to ensure enough work was done to get some good sampling results.

The initial run took 3.106s, averaging 116.5 MB/s throughput. With that baseline, I dived into VTune.

Initial VTune Results

So I compiled my JSON library using CMake’s RelWithDebInfo (release with debug info) so that PDBs were generated from Visual Studio and I could get meaningful profiling information.

Then I started VTune, added an Analysis Target pointing at my executable, and not knowing a great deal about VTune decided to stick with an Analysis Type of Basic Hotspots and see where that took me.

So VTune says that my CPU time was 3.061s (which matches closely to what I recorded out-with VTune), and has noted the most costly functions in terms of execution time. So I went to the source of the most expensive function there, and looked at the assembly view that VTune gives you, and I found something bad, something very bad!

static void json_parse_number(
    struct json_parse_state_s *state,
    struct json_number_s *number) {
  size_t size = 0;
  size_t end = 0;

  while (state->offset < state->size && end == 0) {
    switch (state->src[state->offset]) {
    case '0':
    case '1':
    case '2':
    case '3':
    case '4':
    case '5':
    case '6':
    case '7':
    case '8':
    case '9':
    case '.':
    case 'e':
    case 'E':
    case '+':
    case '-':
      state->data[size++] = state->src[state->offset++];
      break;
    default:
      end = 1;
      break;
    }
  }

  ...
}

Can you notice what heinous crime I have committed? Lets look at the assembly as captured from VTune for the switch statement above:

Now its not obvious from the switch statement, but every execution of that switch is actually performing two loads, every iteration of the loop. Why? Because the compiler doesn’t realise that state->offset cannot be modified outwith the function, and thus it could just cache the value of it into a register and use that!

This is the kind of dumb optimization that it is so easy to miss but so painful for performance when you do miss it. I reworked the code above such that:

static void json_parse_number(
    struct json_parse_state_s *state,
    struct json_number_s *number) {
  size_t offset = state->offset;
  const size_t initialSize = state->size;
  size_t size = 0;
  size_t end = 0;

  while (offset < initialSize && end == 0) {
    switch (state->src[offset]) {
    case '0':
    case '1':
    case '2':
    case '3':
    case '4':
    case '5':
    case '6':
    case '7':
    case '8':
    case '9':
    case '.':
    case 'e':
    case 'E':
    case '+':
    case '-':
      state->data[size++] = state->src[offset++];
      break;
    default:
      end = 1;
      break;
    }
  }

  ...

  // Update state->offset at end!
  state->offset = offset;
}

And this changed reduced the time in json_parse_number from 0.670s to 0.422s – 1.59x faster!

Using The Same Trick Elsewhere

Given the big performance win that helping the compiler realise state can live in registers got me, I decided to propagate the change into some other JSON functions that had tight inner loops that would most benefit from the same improvement:

  • json_get_number_size from 0.355s to 0.184s – 1.93x faster
  • json_get_string_size from 0.175s to 0.115s – 1.52x faster
  • json_parse_string from 0.191s to 0.137s – 1.39x faster
  • json_skip_whitespace from 0.473s to 0.354s – 1.34x faster

Nice little win with a single optimization trick right there.

Two Loops Are Better Than One

The next function that caused me concern was json_skip_all_skippables. This jankily named function will skip whitespace and C-style comments if the user has enabled the C-style comment optional extension.

static int json_skip_all_skippables(
    struct json_parse_state_s *state) {
  int did_consume = 0;
  do {
    if (state->offset == state->size) {
      state->error = json_parse_error_premature_end_of_buffer;
      return 1;
    }

    did_consume = json_skip_whitespace(state);

    if (json_parse_flags_allow_c_style_comments &
        state->flags_bitset) {
      if (state->offset == state->size) {
	state->error = json_parse_error_premature_end_of_buffer;
        return 1;
      }

      did_consume |= json_skip_c_style_comments(state);
    }
  } while (0 != did_consume);

  if (state->offset == state->size) {
    state->error = json_parse_error_premature_end_of_buffer;
    return 1;
  }

  return 0;
}

When I checked in VTune I noticed that the if branch within the do/while loop was taking more time than it should, and I suddenly realised – the compiler was reloading state->flags_bitset every iteration of the loop because the compiler does not know that this value will never change during parsing! Big doh moment here. To fix this, I moved the if statement outside the loop, and instead had two separate loops, one that handled C-style comments and one that did not. This simple change meant that json_skip_all_skippables went from 0.289s to 0.257s – a 1.12x speedup!

Help a Branch Out

Sometimes branches can confuse a compiler, and json_parse_number had a doozy of a branch.

static void json_parse_number(
    struct json_parse_state_s *state,
    struct json_number_s *number) {
  size_t offset = state->offset;
  const size_t initialSize = state->size;
  size_t size = 0;
  size_t end = 0;

  while (offset < initialSize && end == 0) {
    switch (state->src[offset]) {
      ...
    default:
      end = 1;
      break;
    }
  }

  ...
}

The switch statement should break out of the switch and the while when it is hit, so to do this I set end to 1, and then checked end == 0 in the loop condition. The compiler wasn’t smart enough to skip the end == 0 check on the first iteration for starters, and then it really managed to confuse itself on subsequent iterations. To fix this, I moved the end variable into the loop, and then after the switch statement I check if we’ve to end and break the main loop.

static void json_parse_number(
    struct json_parse_state_s *state,
    struct json_number_s *number) {
  size_t offset = state->offset;
  const size_t initialSize = state->size;
  size_t size = 0;

  while (offset < initialSize) {
    int end = 0;
    switch (state->src[offset]) {
      ...
    default:
      end = 1;
      break;
    }

    if (0 != end) {
      break;
    }
  }

  ...
}

This subtle change meant much better generated code, and shaved a couple of milliseconds off of json_parse_number!

Remove a Useless Ternary

In json_parse_string I support the quoting of strings by either using single (‘) or double (“) quotes – depending on an extension flag. The code was:

size_t size = 0;
int is_single_quote = '\'' == state->src[state->offset];

string->string = state->data;

// skip leading '"' or '\''
state->offset++;

while (state->offset < state->size && (is_single_quote ? '\'' : '"') != state->src[state->offset]) {
  ...
}

The optimizer got very confused by the is_single_quote ternary being in the while loop. Given that the quote to use won’t be changing iteration to iteration of the loop, I did the simple change to store the quote to compare against at the head of the loop instead:

const char quote_to_use = '\'' == state->src[offset] ?
    '\'' : '"';

string->string = state->data;

// skip leading '"' or '\''
offset++;

do {
  ...
} while (offset < size &&
    (quote_to_use != state->src[offset]));

I also noticed that if a string begins, it will definitely have at least one iteration in the body of it (otherwise when we earlier called json_get_string_size it would have exited with an error for the user). This meant that I could change the while loop to a do/while loop which aids the codegen too. These changes shaved another few milliseconds off of the execution time.

Cache More Things!

I took stock of where I had got to, and decided to do a second sweep of each of the most costly functions to see if there was something I had missed – and of course there was. So previously I had cached just the offset and size from the state object I pass between functions, but I soon realised that I should be caching the src pointer and data pointers too. This helped explain to the compiler that the pointers wouldn’t be changing during the body of the functions and their values could remain in registers from the entire functions! I also cached flags_bitset when it was used more than once in the body of the functions – again to stop it running away to memory and back for a constant value that would never be changing.

static void json_parse_number(struct json_parse_state_s *state,
                              struct json_number_s *number) {
  const size_t flags_bitset = state->flags_bitset;
  size_t offset = state->offset;
  const size_t size = state->size;
  size_t bytes_written = 0;
  const char *const src = state->src;
  char *data = state->data;

  ...
}

This meant there was a lot more constant state at the start of my functions, but less unnecessary accesses during the functions themselves.

One Evening Optimizing Got Us Where?

So after all the changes above, I reduced the total time taken for the citylots.json file from 3.106s to 2.548s, a nice 1.22x faster than before! The throughput going from 116.5 MB/s to 142.1 MB/s as a result.

I still don’t understand a ton of what VTune offers so I’ll have to keep tinkering, but I’m happy that it helped me identify some easy wins in my JSON parser anyway. The code for the upgrades is in two commits c4b0b90 and e38fd53 if you want to take a look.

05 Apr

Optimizing LLVM IR from the C API

After my previous post on how to read & write LLVM bitcode, I thought I’d follow it up with a post on actually modifying LLVM bitcode files after you’ve read them. LLVM comes with extensive built-in optimization passes, but also plenty of scope to do your own optimizations too.

First off, remember that that when we parse an LLVM bitcode file, we get an LLVM module. So what exactly is an LLVM module?

An LLVM module is basically a collection of global variables and functions. Functions contain basic-blocks. The simplest way to think about a basic block is a scope in a C/C++ file:

int foo(int a, int b, bool c) {
  // basic-block 0 is here
  if (c) {
    // basic-block 1 is here
    return a;
  } else {
    // basic-block 2 is here
    return b;
  }
}

In the above example, we have three basic-blocks. There is always an entry basic-block, one that is tied to the function itself. Then that basic-block can branch to one or more other basic-blocks. Each basic-block contains one or more instructions.

So now we know the basics of how LLVM is put together, lets look at actually doing an optimization. Just to keep with something trivial, lets do constant folding. Constant folding is when you have an instruction that takes only constant arguments, and so you can replace the instruction with a single constant value instead:

int foo() {
  return 13 + 42;
}

In the above example you really don’t want the compiler to actually do 13 + 42 at runtime, what you want is that it instead uses the constant 55 instead. LLVM already has this sort of optimization, but lets do it ourselves to work through the process.

We’ll integrate our new code into the parser I used in my last blog post. Looking back at how we got our LLVM module:

LLVMModuleRef module;
if (0 != LLVMParseBitcode2(memoryBuffer, &module)) {
  fprintf(stderr, "Invalid bitcode detected!\n");
  LLVMDisposeMemoryBuffer(memoryBuffer);
  return 1;
}

So we have our LLVM module, now lets start to look for instructions that use constant arguments. So given an LLVM module, we first have to walk the functions within that module:

for (LLVMValueRef function =
  LLVMGetFirstFunction(module);
  function;
  function =
  LLVMGetNextFunction(function)) {
  // ...
}

Then, we walk the basic-blocks of each function:

for (LLVMBasicBlockRef basicBlock =
  LLVMGetFirstBasicBlock(function);
  basicBlock;
  basicBlock =
  LLVMGetNextBasicBlock(basicBlock)) {
  // ...
}

And finally we walk the instructions in each basic-block:

for (LLVMValueRef instruction =
  LLVMGetFirstInstruction(basicBlock);
  instruction;
  instruction =
  LLVMGetNextInstruction(instruction)) {
  // ...
}

So now we are walking every instruction, of every basic-block, of every function, in our module. Now we need to identify instructions that we want to investigate. Just to keep things simple, we’ll only try and fold binary operators – things like + – * /.

if (LLVMIsABinaryOperator(instruction)) {

Once we know we’ve got a binary operator, we know we’ve got exactly two operands:

LLVMValueRef x = LLVMGetOperand(instruction, 0);
LLVMValueRef y = LLVMGetOperand(instruction, 1);

And to check if we have constant operands or not we simply do:

const int allConstant = LLVMIsAConstant(x) && LLVMIsAConstant(y);

So if allConstant is true, we know we can fold the binary operator. To fold the operation, we look at the opcode of the binary operator – this identifies which binary operator it actually is. Then, we turn the binary operator into a constant expression which does the same thing. LLVM will do all the hard work for us when we create a constant expression with known constant values – it’ll do the constant folding we want for us.

LLVMValueRef replacementValue = 0;

if (allConstant) {
  switch (LLVMGetInstructionOpcode(instruction)) {
  default:
    break;
  case LLVMAdd:
    replacementValue = LLVMConstAdd(x, y);
    break;
  case LLVMFAdd:
    replacementValue = LLVMConstFAdd(x, y);
    break;
  case LLVMSub:
    replacementValue = LLVMConstSub(x, y);
    break;
  case LLVMFSub:
    replacementValue = LLVMConstFSub(x, y);
    break;
  case LLVMMul:
    replacementValue = LLVMConstMul(x, y);
    break;
  case LLVMFMul:
    replacementValue = LLVMConstFMul(x, y);
    break;
  case LLVMUDiv:
    replacementValue = LLVMConstUDiv(x, y);
    break;
  case LLVMSDiv:
    replacementValue = LLVMConstSDiv(x, y);
    break;
  case LLVMFDiv:
    replacementValue = LLVMConstFDiv(x, y);
    break;
  case LLVMURem:
    replacementValue = LLVMConstURem(x, y);
    break;
  case LLVMSRem:
    replacementValue = LLVMConstSRem(x, y);
    break;
  case LLVMFRem:
    replacementValue = LLVMConstFRem(x, y);
    break;
  case LLVMShl:
    replacementValue = LLVMConstShl(x, y);
    break;
  case LLVMLShr:
    replacementValue = LLVMConstLShr(x, y);
    break;
  case LLVMAShr:
    replacementValue = LLVMConstAShr(x, y);
    break;
  case LLVMAnd:
    replacementValue = LLVMConstAnd(x, y);
    break;
  case LLVMOr:
    replacementValue = LLVMConstOr(x, y);
    break;
  case LLVMXor:
    replacementValue = LLVMConstXor(x, y);
    break;
  }
}

Now we have our replacement value for the binary operator, we can replace the original instruction with the new constant:

// if we managed to find a more optimal replacement
if (replacementValue) {
  // replace all uses of the old instruction with the new one
  LLVMReplaceAllUsesWith(instruction, replacementValue);

  // erase the instruction that we've replaced
  LLVMInstructionEraseFromParent(instruction);
}

And that’s it! Lets run this on a simple example I’ve knocked together:

define i32 @foo() {
  %a = mul i32 13, 46
  %1 = add i32 4, %a
  %2 = sub i32 %1, 400
  %3 = shl i32 %2, 2
  ret i32 %3
}

We can run our code and everything… explodes? You’ll get some horrific segfault deep in LLVM and run away screaming, if you didn’t know what the cause is. Basically when hacking with an LLVM module, you’ve got to be super careful about deleting instructions, basic-blocks or functions while you are still iterating through the lists. You either need to store all the values and do all the replacements after you’ve finished with the iterators, or handle the replacements very carefully.

So the way I do it when iterating using the C API is to remember the last instruction in the instruction stream, and then if I happen to replace the current instruction and delete it, I know I can look at the last instruction and pick up the remainder of the instruction stream from there.

LLVMValueRef lastInstruction = 0;

// loop through all the instructions in the basic block
for (LLVMValueRef instruction =
  LLVMGetFirstInstruction(basicBlock);
  instruction;) {
  LLVMValueRef replacementValue = 0;

  // ...

  // if we managed to find a more optimal replacement
  if (replacementValue) {
    // replace all uses of the old instruction with the new one
    LLVMReplaceAllUsesWith(instruction, replacementValue);

    // erase the instruction that we've replaced
    LLVMInstructionEraseFromParent(instruction);

    // if we don't have a previous instruction, get the first
    // one from the basic block again
    if (!lastInstruction) {
      instruction = LLVMGetFirstInstruction(basicBlock);
    } else {
      instruction = LLVMGetNextInstruction(lastInstruction);
    }
  } else {
    lastInstruction = instruction;
    instruction = LLVMGetNextInstruction(instruction);
  }
}

So now we are correctly handling the case where we are removing instructions within the instruction stream. So if we compile and run everything again, what do we get? So for the example I showed above, after I’ve run my bitcode read -> constant fold -> bitcode write, I get the following LLVM IR:

define i32 @foo() {
  ret i32 808
}

Nice! That looks like much more optimal code.

I will note that this is not how real LLVM passes work (they are full C++ with classes and templates), but it does allow you to easily work with LLVM IR yourself.

I’ve updated my llvm_bc_parse_example to include this optimization code.

I hope this proves useful to any budding compiler engineers who want to start tinkering with LLVM!

29 Mar

How to read & write LLVM bitcode

I’ve read multiple posts on social media now complaining about how scary LLVM is to get to grips with. It doesn’t help that the repository is ginormous, there are frequently hundreds of commits a day, the mailing list is nearly impossible to keep track of, and the resulting executables are topping 40Mb now…

Those tidbits aside – LLVM is super easy to work with once you get to grips with the beast that it is. To help aid people in using LLVM, I thought I’d put together the most trivial no-op example you can do with LLVM – parsing one of LLVM’s intermediate representation files (known as bitcode, file extension .bc) and then writing it back out.

Firstly, lets go through some high level LLVM terms:

  • LLVM’s main abstraction for user code is the Module. It’s a class that contains all the functions, global variables, and instructions for the code you or other users write.
  • Bitcode files are effectively a serialization of an LLVM Module such that it can be reconstructed in a different program later.
  • LLVM uses MemoryBuffer objects to handle data that comes from files, stdin, or arrays.

For my example, we’ll use the LLVM C API – a more stable abstraction ontop of LLVM’s core C++ headers. The C API is really useful if you’ve got code that you want to work with multiple versions of LLVM, it’s significantly more stable than the LLVM C++ headers. (An aside, I use LLVM extensively for my job and nearly every week some LLVM C++ header change will break our code. I’ve never had the C API break my code.)

First off, I’m going to assume you’ve pulled LLVM, built and installed it. Some simple steps to do this:

git clone https://git.llvm.org/git/llvm.git <llvm dir>
cd <llvm dir>
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=install ..
cmake --build . --target install

After doing the above, you’ll have an LLVM install in <llvm dir>/build/install!

So for our little executable I’ve used CMake. CMake is by far the easiest way to integrate with LLVM as it is the build system LLVM also uses.

project(llvm_bc_parsing_example)
cmake_minimum_required(VERSION 3.4.3)

# option to allow a user to specify where an LLVM install is on the system
set(LLVM_INSTALL_DIR "" CACHE STRING "An LLVM install directory.")

if("${LLVM_INSTALL_DIR}" STREQUAL "")
  message(FATAL_ERROR "LLVM_INSTALL_DIR not set! Set it to the location of an LLVM install.")
endif()

# fixup paths to only use the Linux convention
string(REPLACE "\\" "/" LLVM_INSTALL_DIR ${LLVM_INSTALL_DIR})

# tell CMake where LLVM's module is
list(APPEND CMAKE_MODULE_PATH ${LLVM_INSTALL_DIR}/lib/cmake/llvm)

# include LLVM
include(LLVMConfig)

add_executable(llvm_bc_parsing_example main.c)

target_include_directories(llvm_bc_parsing_example PUBLIC ${LLVM_INCLUDE_DIRS})

target_link_libraries(llvm_bc_parsing_example PUBLIC LLVMBitReader LLVMBitWriter)

So now we’ve got our CMake setup, and we can use our existing LLVM install, we can now get working on our actual C code!

So to use the LLVM C API there is one header you basically always need:

#include <llvm-c/Core.h>

And two extra headers we need for our executable are the bitcode reader and writer:

#include <llvm-c/BitReader.h>
#include <llvm-c/BitWriter.h>

Now we create our main function. I’m assuming here that we always take exactly 2 command line arguments, the first being the input file, the second being the output file. LLVM has a system whereby if a file named ‘-‘ is provided, that means read from stdin or write to stdout, so I decided to support that too:

if (3 != argc) {
  fprintf(stderr, "Invalid command line!\n");
  return 1;
}

const char *const inputFilename = argv[1];
const char *const outputFilename = argv[2];

So first we parse the input file. We’ll create an LLVM memory buffer object from either stdin, or a filename:

LLVMMemoryBufferRef memoryBuffer;

// check if we are to read our input file from stdin
if (('-' == inputFilename[0]) && ('\0' == inputFilename[1])) {
  char *message;
  if (0 != LLVMCreateMemoryBufferWithSTDIN(&memoryBuffer, &message)) {
    fprintf(stderr, "%s\n", message);
    free(message);
    return 1;
  }
} else {
  char *message;
  if (0 != LLVMCreateMemoryBufferWithContentsOfFile(
               inputFilename, &memoryBuffer, &message)) {
    fprintf(stderr, "%s\n", message);
    free(message);
    return 1;
  }
}

So after this code, memoryBuffer will be usable to read our bitcode file into an LLVM module. So lets create the module!

// now create our module using the memory buffer
LLVMModuleRef module;
if (0 != LLVMParseBitcode2(memoryBuffer, &module)) {
  fprintf(stderr, "Invalid bitcode detected!\n");
  LLVMDisposeMemoryBuffer(memoryBuffer);
  return 1;
}

// done with the memory buffer now, so dispose of it
LLVMDisposeMemoryBuffer(memoryBuffer);

Once we’ve got our module, we no longer need the memory buffer, so we can free up the memory straight away. And that’s it! We’ve managed to take an LLVM bitcode file, deserialize it into an LLVM module, which we could (I’m not going to in this blog post at least!) fiddle with. So lets assume you’ve done all you wanted with the LLVM module, and want to write the sucker back out to a bitcode file again.

The approach is orthogonal to the reading approach, we look for the special filename ‘-‘ and handle accordingly:

// check if we are to write our output file to stdout
if (('-' == outputFilename[0]) && ('\0' == outputFilename[1])) {
  if (0 != LLVMWriteBitcodeToFD(module, STDOUT_FILENO, 0, 0)) {
    fprintf(stderr, "Failed to write bitcode to stdout!\n");
    LLVMDisposeModule(module);
    return 1;
  }
} else {
  if (0 != LLVMWriteBitcodeToFile(module, outputFilename)) {
    fprintf(stderr, "Failed to write bitcode to file!\n");
    LLVMDisposeModule(module);
    return 1;
  }
}

Lastly, we should be good citizens and clean up our garbage, so also delete the module to:

LLVMDisposeModule(module);

And that’s it! You are now able to parse and then write an LLVM bitcode file. I’ve put the full example up here on GitHub – https://github.com/sheredom/llvm_bc_parsing_example.

Maybe I’ll do a follow-up post at some point taking you through the basics of how to do things to an LLVM module, but for now, adieu!

11 Feb

Adding Latin and Greek support to utf8.h!

I’ve had a lot of people using my unicode library utf8.h since its release – thanks to all who’ve found the library useful and provided feedback!

One of the things that I’d scrimped on previously was my support for case-insensitive comparisons for letters beyond those in ASCII. I knew little about this, but when a user requested that my library also supported accented latin characters, and later greek symbols, I jumped to the occassion to add support.

The following utf8 functions utf8casecmp, utf8ncasecmp, utf8casestr, utf8isupper, utf8islower, utf8lwr, and utf8upr, have been modified to support the Latin-1 Supplement, Latin Extended-A, Latin Extended-B, and Greek & Coptic unicode sections. I’ve also added two new functions utf8lwrcodepoint and utf8uprcodepoint that’ll make a single codepoint upper or lower case.

The main logic of how you convert between the lower and upper cases is both slightly concise and utterly disgusting. Lets take a look at the code to convert a lower case codepoint to an upper case.

For ASCII characters, and also some Latin and Greek, the upper case codepoints are simply 32 places below the lower case ones:

if ((('a' <= cp) && ('z' >= cp)) ||
    ((0x00e0 <= cp) && (0x00f6 >= cp)) ||
    ((0x00f8 <= cp) && (0x00fe >= cp)) ||
    ((0x03b1 <= cp) && (0x03c1 >= cp)) ||
    ((0x03c3 <= cp) && (0x03cb >= cp))) {
  cp -= 32;
}

The next set of codepoints are offset by 1 between the lower and upper cased variants. Depending on whether the lower case codepoint was odd or even, we have two if statements that handle both cases:

if (((0x0100 <= cp) && (0x012f >= cp)) ||
    ((0x0132 <= cp) && (0x0137 >= cp)) ||
    ((0x014a <= cp) && (0x0177 >= cp)) ||
    ((0x0182 <= cp) && (0x0185 >= cp)) ||
    ((0x01a0 <= cp) && (0x01a5 >= cp)) ||
    ((0x01de <= cp) && (0x01ef >= cp)) ||
    ((0x01f8 <= cp) && (0x021f >= cp)) ||
    ((0x0222 <= cp) && (0x0233 >= cp)) ||
    ((0x0246 <= cp) && (0x024f >= cp)) ||
    ((0x03d8 <= cp) && (0x03ef >= cp))) {
  cp &= ~0x1;
}

if (((0x0139 <= cp) && (0x0148 >= cp)) ||
    ((0x0179 <= cp) && (0x017e >= cp)) ||
    ((0x01af <= cp) && (0x01b0 >= cp)) ||
    ((0x01b3 <= cp) && (0x01b6 >= cp)) ||
    ((0x01cd <= cp) && (0x01dc >= cp))) {
  cp -= 1;
  cp |= 0x1;
}

And lastly, for all other codepoints in the ranges that don’t have any sane approach whatsoever, we’ll fire them all into a single big switch statement:

switch (cp) {
  default: break;
  case 0x00ff: cp = 0x0178; break;
  case 0x0180: cp = 0x0243; break;
  case 0x01dd: cp = 0x018e; break;
  case 0x019a: cp = 0x023d; break;
  case 0x019e: cp = 0x0220; break;
  case 0x0292: cp = 0x01b7; break;
  case 0x01c6: cp = 0x01c4; break;
  case 0x01c9: cp = 0x01c7; break;
  case 0x01cc: cp = 0x01ca; break;
  case 0x01f3: cp = 0x01f1; break;
  case 0x01bf: cp = 0x01f7; break;
  case 0x0188: cp = 0x0187; break;
  case 0x018c: cp = 0x018b; break;
  case 0x0192: cp = 0x0191; break;
  case 0x0199: cp = 0x0198; break;
  case 0x01a8: cp = 0x01a7; break;
  case 0x01ad: cp = 0x01ac; break;
  case 0x01b0: cp = 0x01af; break;
  case 0x01b9: cp = 0x01b8; break;
  case 0x01bd: cp = 0x01bc; break;
  case 0x01f5: cp = 0x01f4; break;
  case 0x023c: cp = 0x023b; break;
  case 0x0242: cp = 0x0241; break;
  case 0x037b: cp = 0x03fd; break;
  case 0x037c: cp = 0x03fe; break;
  case 0x037d: cp = 0x03ff; break;
  case 0x03f3: cp = 0x037f; break;
  case 0x03ac: cp = 0x0386; break;
  case 0x03ad: cp = 0x0388; break;
  case 0x03ae: cp = 0x0389; break;
  case 0x03af: cp = 0x038a; break;
  case 0x03cc: cp = 0x038c; break;
  case 0x03cd: cp = 0x038e; break;
  case 0x03ce: cp = 0x038f; break;
  case 0x0371: cp = 0x0370; break;
  case 0x0373: cp = 0x0372; break;
  case 0x0377: cp = 0x0376; break;
  case 0x03d1: cp = 0x03f4; break;
  case 0x03d7: cp = 0x03cf; break;
  case 0x03f2: cp = 0x03f9; break;
  case 0x03f8: cp = 0x03f7; break;
  case 0x03fb: cp = 0x03fa; break;
};

With the above, we can handle all the lower/upper case variants for the Latin and Greek characters requested!

I hope these additions are found to be useful to my users, and if you’ve got any requests yourself feel free to file them here.

24 Dec

Introducing process.h!

I’ve long been after a cross-platform way to launch processes and interact with them for use from C & C++. After a brief scouring of the interwebs, I didn’t find any that matched my criteria:

  • No build system dependencies.
  • Works in C and C++.
  • Works on Windows, Linux and macOS.
  • Single header (or at least single header / single source file).

So I did what I always do when confronted with this issue – I wrote my own!

process.h is a cross-platform C & C++ single header library that works on Windows, Linux and macOS. It contains six functions that let you:

There are some gotchas to know about:

  • Waiting a process will close the FILE associated with the stdin of the child. I need to do this so that any process that is waiting and reading the entire contents of the stdin can finish.
  • You can destroy a process before it has completed – this will close the stdin, stdout, and stderr of the child process and free the handle of the child process. This will allow a child process to outlive the parent process if required – you need to call process_join before process_destroy if you want to wait and then destroy the launched process.

Hopefully this proves useful to some of my followers! I’ve recorded some todos of things I want to add to the library (which will include an interface change at a later date) – so stay tuned.

13 Nov

Cross compiling Sollya to Windows with Emscripten

One component of Codeplay’s ComputeAorta that I manage is our high precision maths library Abacus.

One major component of Abacus, and in fact all math libraries, are a requirement to have remez reduced polynomial approximations of functions. In the past we’ve made use of Maple, Mathematica, lolremez, our own fork of lolremez, and to be honest none of them have been satisfactory to our needs. We want a scriptable solution that we can use to bake the generated polynomials automatically into Abacus with minimal user intervention.

I was lucky enough to be involved in a twitter thread with Marc B. Reynolds where he pointed me at Sollya. It’s Linux only which sucks (I’m primarily a Windows developer), but I fired up a VM and tried it out – and I’ve got to say, its pretty darn good! The non-Windows support is a big issue though, so how to fix that?

Enter stage left – Emscripten!

So I’ve known about Emscripten for a while, but never had a really compelling reason to use it. I suddenly thought ‘I wonder if I could use Emscripten to compile Sollya to JavaScript, then use Node.js to run it on Windows?’.

Yep, you are right, I’m mad. This can’t be a good way to take software meant for Linux and cross compile it for Windows, right? That just made me all the more curious to see if it could work.

Sollya and all its dependencies

Sollya requires a bunch of different projects to work: libxml2GMP, MPFR, MPFI, fplll, and lastly Sollya itself. So I first downloaded all of these, built them all from source, and built Sollya using gcc on Linux – just to test that I could build it.

Then, using Emscripten’s emconfigure (which you place before the typical linux call to ./configure) it replaces any compiler usage with the Emscripten compiler emcc, we can try and build Sollya again but for JavaScript!

So I started with libxml2, which worked! And then onto GMP – and explosions. Some stack overflowing pointed me to Compiling GMP/MPFR with Emscripten which states that for some reason (I didn’t dig into why) Emscripten couldn’t compile GMP if the host platform was not 32 bits. I looked at the answer where it suggests you chroot and thought ‘that seems like a lot of work to mess with my current 64-bit VM that I do other things on, I’ll fire up a new VM to mess with’. But since I’m creating a new VM anyway, I decided to just create a 32-bit Ubuntu VM and use that instead (which meant less configuration work on my part).

So with my 32-bit VM, I started the whole process of compiling libxml2, gmp, mpfr, mpfi, fplll (wow I’m on a roll!) and finally I get to Sollya and… it failed.

Sollya and dlopen

Sollya makes use of dlopen, and thus the ./configure script in Sollya will check that dlopen is a command that works on the target platform. The problem is, ./configure doesn’t use the correct signature for the dlopen call – it just does:

 extern char dlopen();

and then ensures that the linker doesn’t complain when this is linked against -ldl. The signature of dlopen is:

void* dlopen(char*, int);

and Emscripten looks for that exact signature, and complains if the function doesn’t have the correct number and type of arguments. This meant as far as ./configure was concerned, the system didn’t have dlopen (even though Emscripten can stub implement it), and it failed.

Ever the hacker, I just decided to patch the ./configure to not error out:

sed -i -e "s/as_fn_error .. \"libdl unusable\"/echo \"skipped\"\#/" ./sollya/configure

tried to build again, and Sollya built!

Emscripten and .bc’s

Emscripten seems to output an LLVM bitcode (.bc) file by default – and I couldn’t work out how to tell emconfigure to output a JavaScript file instead.

So what I did was take the bitcode file that was in ‘sollya’ and used emcc directly to turn this into a JavaScript file.

emcc complained if the input bitcode file wasn’t named <something>.bc, so I first renamed it to sollya.bc:

cp sollya sollya.bc
emcc sollya.bc -o sollya.js

and I got a whopping 27MB JavaScript file out!

Next I used node to run this JavaScript against a simple test script I wrote:

print("Single precision:");
r=[1/sqrt(2)-1; sqrt(2)-1];
f=log2(1+x)/x;
p=fpminimax(f, 11, [|single...|], r, floating, relative);
p;
print("\nDouble precision:");
p=fpminimax(f, 21, [|double...|], r, floating, relative);
p;

and ran Node.js:

nodejs sollya.js < script.sollya

and it ran!

But it kept on running, like infinite-loop running – the thing just never stopped. I was getting a ton of ‘sigaction not implemented’ messages, so I wondered if Sollya was doing something really ugly with signals to handle exiting from a script. I thought about digging into it, then realised Sollya has an explicit ‘quit;’ command, so I added that to the bottom of the script:

print("Single precision:");
r=[1/sqrt(2)-1; sqrt(2)-1];
f=log2(1+x)/x;
p=fpminimax(f, 11, [|single...|], r, floating, relative);
p;
asd;
print("\nDouble precision:");
p=fpminimax(f, 21, [|double...|], r, floating, relative);
p;
quit;

and it ran and exited as expected.

> Single precision:
> Warning: at least one of the given expressions is not a constant but requires evaluation.
Evaluation is guaranteed to ensure the inclusion property. The approximate result is at least 165 bit accurate.
> > > 1.44269502162933349609375 + x * (-0.7213475704193115234375 + x * (0.4809020459651947021484375 + x * (-0.360668718814849853515625 + x * (0.2883343398571014404296875 + x * (-0.24055089056491851806640625 + x * (0.21089743077754974365234375 + x * (-0.1813324391841888427734375 + x * (0.10872711241245269775390625 + x * (-0.10412885248661041259765625 + x * (0.35098421573638916015625 + x * (-0.383228302001953125)))))))))))
> Warning: the identifier "asd" is neither assigned to, nor bound to a library function nor external procedure, nor equal to the current free variable.
Will interpret "asd" as "x".
x
> 
Double precision:
> > 1.44269504088896338700465094007086008787155151367187 + x * (-0.72134752044448169350232547003543004393577575683594 + x * (0.4808983469630028761976348050666274502873420715332 + x * (-0.36067376022224723053355432966782245784997940063477 + x * (0.288539008174513611493239295668900012969970703125 + x * (-0.24044917347913088989663776828820118680596351623535 + x * (0.20609929188248227172053361755388323217630386352539 + x * (-0.18033688048265933412395156665297690778970718383789 + x * (0.160299431107568057797152505372650921344757080078125 + x * (-0.144269475404082331282396012284152675420045852661133 + x * (0.13115467750388201673139576541871065273880958557129 + x * (-0.120225818807988840686284959247132064774632453918457 + x * (0.110964912764316969706612781010335311293601989746094 + x * (-0.103018221150312991318820365904684877023100852966309 + x * (9.6317404417675320238423353202961152419447898864746e-2 + x * (-9.0652910508716211257507211485062725841999053955078e-2 + x * (8.4035326134831819788750806310417829081416130065918e-2 + x * (-7.5783141066360651394440139938524225726723670959473e-2 + x * (7.650699022117241065998882731946650892496109008789e-2 + x * (-9.2331285631306825312236696845502592623233795166016e-2 + x * (8.7941823766079466051515112212655367329716682434082e-2 + x * (-3.8635539215562890447142052607887308113276958465576e-2)))))))))))))))))))))

So now I have a JavaScript file that works when I run it through Node.js, but we’ve got a couple of issues:

  • The JavaScript is freaking huge!
  • We don’t want to require Node.js to be installed either for our developers.

File size

Digging into Emscripten I found that there were a couple of options we could use:

  • -O3 – same as all compilers, we can specify that the compiler should optimize the code heavily.
  • -llvm-lto 2 – this enables all the optimizations to occur on the entire set of bitcode files once they are all linked together. This will allow for a ton more inlining to take place which should help our performance.

Adding both these options, the size of the produce sollya.js was 4.1MB! A whopping 6.5x reduction in file size – and its actually optimized properly now too.

Creating a standalone Windows binary?

So I’ve got sollya.js – and I can run this with Node.js on Windows and get actual valid polynomials. But I really want a standalone executable that has no dependencies, is this possible? Searching around, I found out about nexe – a way to bundle a Node.js application into a single executable. It basically puts Node.js and the JavaScript file into the same executable, and calls Node.js on the JavaScript at runtime. While this isn’t amazing – would it work?

First off – you have to use nexe on the platform you want to run the end executable on  – so I copied the sollya.js from my VM to my Windows host, and then after installing nexe I ran:

nexe -i sollya.js -f -o sollya.exe

And what do you know – I can run sollya.exe and it works as expected. The downside is that because the executable is shipping an entire copy of Node.js with it – sollya.exe is a whopping 29MB to ship around.

Performance

I’ve compared the natively compiled sollya executable with the JavaScript variant. I ran them 50 times, and averaged out the results.

sollya sollyajs  JS vs Native Performance
1.37144s 4.93946s 3.6x slower

So as expected – given that we are running through JavaScript and Node.js, we are 3.6x slower than the natively compiled executable. I’m honestly surprised we are not slower (I’d heard horror stories of 90x slowdowns with Emscripten) so this seems not too bad to me.

Conclusion

It seems that with Emscripten, in combination with Node.js and Nexe, we can compile a program on Linux to be run entirely on Windows – which is pretty freaking cool. There are probably many other more sane ways to do exactly this, but I find it pretty amazing that this is even possible. Now I can ‘natively’ run a Windows executable which will calculate all the polynomial approximations I need on Windows too – saving our team from having to have a Linux VM when re-generating the polynomials is required.

CMake script to build Sollya with Emscripten

In case anyone is interested, I use a CMake file to bring in all the dependencies and build Sollya using Emscripten.

cmake_minimum_required(VERSION 3.4)
project(emsollya)

include(ExternalProject)

ExternalProject_Add(libxml2
  PREFIX ${CMAKE_BINARY_DIR}/libxml2
  URL ftp://xmlsoft.org/libxml2/libxml2-git-snapshot.tar.gz
  PATCH_COMMAND NOCONFIGURE=1 sh ${CMAKE_BINARY_DIR}/libxml2/src/libxml2/autogen.sh
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/libxml2/src/libxml2/configure
    --disable-shared --without-python --prefix=${CMAKE_BINARY_DIR}/libxml2
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(gmp
  PREFIX ${CMAKE_BINARY_DIR}/gmp
  URL https://gmplib.org/download/gmp/gmp-6.1.2.tar.bz2
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/gmp/src/gmp/configure
    --disable-assembly --enable-cxx --disable-shared
    --prefix=${CMAKE_BINARY_DIR}/gmp
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(mpfr
  DEPENDS gmp
  PREFIX ${CMAKE_BINARY_DIR}/mpfr
  URL http://www.mpfr.org/mpfr-current/mpfr-3.1.6.tar.bz2
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/mpfr/src/mpfr/configure
    --disable-shared --with-gmp=${CMAKE_BINARY_DIR}/gmp
    --prefix=${CMAKE_BINARY_DIR}/mpfr
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(mpfi
  DEPENDS gmp mpfr
  PREFIX ${CMAKE_BINARY_DIR}/mpfi
  URL https://gforge.inria.fr/frs/download.php/file/30129/mpfi-1.5.1.tar.bz2
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/mpfi/src/mpfi/configure
    --disable-shared --with-gmp=${CMAKE_BINARY_DIR}/gmp
    --with-mpfr=${CMAKE_BINARY_DIR}/mpfr
    --prefix=${CMAKE_BINARY_DIR}/mpfi
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(fplll
  DEPENDS gmp mpfr
  PREFIX ${CMAKE_BINARY_DIR}/fplll
  GIT_REPOSITORY https://github.com/fplll/fplll.git
  GIT_TAG cd47f76b017762317245de7878c7b41eff9ab5d0
  PATCH_COMMAND sh ${CMAKE_BINARY_DIR}/fplll/src/fplll/autogen.sh
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/fplll/src/fplll/configure
    --disable-shared --with-gmp=${CMAKE_BINARY_DIR}/gmp
    --with-mpfr=${CMAKE_BINARY_DIR}/mpfr
    --prefix=${CMAKE_BINARY_DIR}/fplll
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(sollya
  DEPENDS gmp mpfr mpfi fplll libxml2
  PREFIX ${CMAKE_BINARY_DIR}/sollya
  URL http://sollya.gforge.inria.fr/sollya-weekly-11-05-2017.tar.bz2
  PATCH_COMMAND sed -i -e "s/as_fn_error .. \"libdl unusable\"/echo \"skipped\"\#/"
    ${CMAKE_BINARY_DIR}/sollya/src/sollya/configure
  CONFIGURE_COMMAND EMCONFIGURE_JS=1 emconfigure
    ${CMAKE_BINARY_DIR}/sollya/src/sollya/configure
    --disable-shared --with-gmp=${CMAKE_BINARY_DIR}/gmp
    --with-fplll=${CMAKE_BINARY_DIR}/fplll
    --with-mpfi=${CMAKE_BINARY_DIR}/mpfi
    --with-mpfr=${CMAKE_BINARY_DIR}/mpfr
    --with-xml2=${CMAKE_BINARY_DIR}/libxml2
    --prefix=${CMAKE_BINARY_DIR}/fplll
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Get_Property(sollya BINARY_DIR)

add_custom_command(OUTPUT ${CMAKE_BINARY_DIR}/sollya.js
  COMMAND cmake -E copy ${BINARY_DIR}/sollya ${CMAKE_BINARY_DIR}/sollya.bc
  COMMAND emcc --memory-init-file 0 -O3 --llvm-lto 2 ${CMAKE_BINARY_DIR}/sollya.bc -o ${CMAKE_BINARY_DIR}/sollya.js
  DEPENDS ${BINARY_DIR}/sollya
)

add_custom_target(sollya_js ALL DEPENDS ${CMAKE_BINARY_DIR}/sollya.js)

add_dependencies(sollya_js sollya)
06 Nov

LLVM & CRT – auto-magically selecting the correct CRT

LLVM comes with a really useful set of options LLVM_USE_CRT_<config> which allows you to specify a different C RunTime (CRT) when compiling with Visual Studio. If you want to be able to compile LLVM as a release build, but compile some code that uses it in debug (EG. our ComputeAorta product that allows customers to implement OpenCL/Vulkan on their hardware insanely quickly), Visual Studio will complain about mixing the differing version of the CRT. By using the LLVM_USE_CRT_<config> option, we can specify that LLVM compiles in a release build, but using a debug CRT.

There is one annoying catch with this though – compiling LLVM can be expensive to build. We’ll average 10 minutes build time for a full build of LLVM. We don’t want to recompile LLVM, and we don’t want to be constantly building different copies of LLVM everytime we pull in the latest commits for 2-4 different versions of the CRT. We want to be able to change a ComputeAorta build from debug/release without having to rebuild LLVM, and we want all this to just work™ without any manual input from a developer.

Changing LLVM

So what we need to do is detect which CRT LLVM was built against. My first thought was to allow LLVM to export which CRT it was built against into an LLVM install. LLVM already outputs an LLVMConfig.cmake during its install process, so why not just record what CRT was used too? I contacted the LLVM mailing list asking... and got no response. I’ve found in general if you are not a super active contributor and located in the bay area this is a common occurrence. Not wanting to be that guy that nags on the mailing list about things no-one else clearly cares about, how else could I solve it?

Detecting the CRT

So I reasoned that since the Visual Studio linker could detect and give me a good error message when I was accidentally mixing CRT versions, there must be some information recorded in the library files produced from Visual Studio that said which CRT the library was linked against. Using dumpbin.exe (which is included with Visual Studio) I first called:

$ dumpbin /? 
Microsoft (R) COFF/PE Dumper Version 14.00.24215.1 
Copyright (C) Microsoft Corporation. All rights reserved. 
 
usage: DUMPBIN [options] [files] 
 
 options: 
 
 /ALL 
 /ARCHIVEMEMBERS 
 /CLRHEADER 
 /DEPENDENTS 
 /DIRECTIVES 
 /DISASM[:{BYTES|NOBYTES}] 
 /ERRORREPORT:{NONE|PROMPT|QUEUE|SEND} 
 /EXPORTS 
 /FPO 
 /HEADERS 
 /IMPORTS[:filename] 
 /LINENUMBERS 
 /LINKERMEMBER[:{1|2}] 
 /LOADCONFIG 
 /NOLOGO 
 /OUT:filename 
 /PDATA 
 /PDBPATH[:VERBOSE] 
 /RANGE:vaMin[,vaMax] 
 /RAWDATA[:{NONE|1|2|4|8}[,#]] 
 /RELOCATIONS 
 /SECTION:name 
 /SUMMARY 
 /SYMBOLS 
 /TLS 
 /UNWINDINFO

And through a process of elimination I ran the ‘/DIRECTIVES’ command against one of the .lib files in LLVM which gave:

$ dumpbin /DIRECTIVES LLVMCore.lib
Microsoft (R) COFF/PE Dumper Version 14.00.24215.1
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file LLVMCore.lib

File Type: LIBRARY

   Linker Directives
   -----------------
   /FAILIFMISMATCH:_MSC_VER=1900
   /FAILIFMISMATCH:_ITERATOR_DEBUG_LEVEL=2
   /FAILIFMISMATCH:RuntimeLibrary=MDd_DynamicDebug
   /DEFAULTLIB:msvcprtd
   /FAILIFMISMATCH:_CRT_STDIO_ISO_WIDE_SPECIFIERS=0
   /FAILIFMISMATCH:LLVM_ENABLE_ABI_BREAKING_CHECKS=1
   /DEFAULTLIB:MSVCRTD
   /DEFAULTLIB:OLDNAMES

...

And what do you know ‘/FAILIFMISMATCH:RuntimeLibrary=MDd_DynamicDebug’ is telling the linker to output an error message if the CRT is not the dynamic debug variant! So now I have a method of detecting the CRT from one of LLVM’s libraries, how to incorporate that in our build?

CMake Integration

LLVM uses CMake for its builds, and thus we also use CMake for our builds. We already include LLVM by specifying the location of an LLVM install like:

$ cmake -DCA_LLVM_INSTALL_DIR=<directory> .
-- Overriding option 'CA_LLVM_INSTALL_DIR' to '<directory>' (default was '').

And then within our CMake we do:

# Setup LLVM/Clang search paths.
list(APPEND CMAKE_MODULE_PATH
  ${CA_LLVM_INSTALL_DIR}/lib/cmake/llvm
  ${CA_LLVM_INSTALL_DIR}/lib/cmake/clang)

# Include LLVM.
include(LLVMConfig)

# Include Clang.
include(ClangTargets)

So I added a new DetectLLVMMSVCCRT.cmake to our CMake modules and included it just after the ClangTargets include. This does the following:

  • Get the directory of CMAKE_C_COMPILER (always cl.exe in our case).
  • Look for dumpbin.exe in the same directory.
  • Get the location of LLVMCore.lib.
    • My reasoning is that most libraries in LLVM could change over time, but the core library of LLVM is unlikely to be moved (I hope!).
  • Run dumpbin /DIRECTIVES LLVMCore.lib
    • Find the first usage of ‘/FAILIFMISMATCH:RuntimeLibrary=’
    • Get the string that occurs between ‘/FAILIFMISMATCH:RuntimeLibrary=’ and the next ‘_’

And then we’ve got the CRT we need to use to build with. To actually set the CRT to use, we can just call LLVM’s ChooseMSVCCRT.cmake (that ships in an LLVM install), specifying the LLVM_USE_CRT_<config> variables and voila, we’ll be using the same CRT as LLVM, and get no linker errors!

The full CMake script is:

if(NOT CMAKE_SYSTEM_NAME STREQUAL Windows)
  return()
endif()

# Get the directory of cl.exe
get_filename_component(tools_dir "${CMAKE_C_COMPILER}" DIRECTORY)

# Find the dumpbin.exe executable in the directory of cl.exe
find_program(dumpbin "dumpbin.exe" PATHS "${tools_dir}" NO_DEFAULT_PATH)

if("${dumpbin}" STREQUAL "dumpbin-NOTFOUND")
  message(WARNING "Could not detect which CRT LLVM was built against - "
                  "could not find 'dumpbin.exe'.")
  return()
endif()

# Get the location in the file-system of LLVMCore.lib
get_target_property(llvmcore LLVMCore LOCATION)

if("${llvmcore}" STREQUAL "llvmcore-NOTFOUND")
  message(WARNING "Could not detect which CRT LLVM was built against - "
                  "could not find location of 'LLVMCore.lib'.")
  return()
endif()

# Get the directives that LLVMCore.lib contains
execute_process(COMMAND "${dumpbin}" "/DIRECTIVES" "${llvmcore}"
  OUTPUT_VARIABLE output)

# Find the first directive specifying what CRT to use
string(FIND "${output}" "/FAILIFMISMATCH:RuntimeLibrary=" position)

# Strip away everything but the directive we want to examine
string(SUBSTRING "${output}" ${position} 128 output)

# Remove the directive prefix which we don't need
string(REPLACE "/FAILIFMISMATCH:RuntimeLibrary=" "" output "${output}")

# Get the position of the '_' character that breaks the CRT from all else
string(FIND "${output}" "_" position)

# Substring output to be one of the four CRT values: MDd MD MTd MT
string(SUBSTRING "${output}" 0 ${position} output)

# Set all possible CMAKE_BUILD_TYPE's to the CRT that LLVM was linked against
set(LLVM_USE_CRT_DEBUG "${output}")
set(LLVM_USE_CRT_RELWITHDEBINFO "${output}")
set(LLVM_USE_CRT_MINSIZEREL "${output}")
set(LLVM_USE_CRT_RELEASE "${output}")

# Include the LLVM cmake module to choose the correct CRT
include(ChooseMSVCCRT)

Conclusion

We’ve been able to do what we set out to do – auto-magically make our project that uses an LLVM install work reliably even with mixed Debug/Release builds. This has reduced the number of LLVM compiles I do daily by 2x (yay) and also allowed me to stop tracking (and caring) about CRT conflicts and how to avoid them.

28 Oct

Slides from my Khronos Munich Chapter talk

I gave a talk on Friday 13th of October 2017 at the Khronos Munich Chapter titled ‘OpenCL to Vulkan: A Porting Guide’. I covered how to port from the OpenCL API to the Vulkan API, some common problems our customers have faced, and how to fix them. The slides are available here.

The talk covered some of the major pitfalls our customers have had in porting OpenCL applications to Vulkan, and also briefly covered the work we did in collaboration with Google and Adobe – clspv.

I hope the slide deck is useful to those of you who couldn’t attend in person.

07 Sep

I’m speaking at the Munich Khronos Chapter Meeting 13th October 2017

Previously I had begun a series of blog posts detailing how to port applications from OpenCL -> Vulkan.

  1. OpenCL -> Vulkan: A Porting Guide (#1)
  2. OpenCL -> Vulkan: A Porting Guide (#2)
  3. OpenCL -> Vulkan: A Porting Guide (#3)

Instead of continuing this blog series, I’m converting the entire contents into a slide deck, and will be presenting it at the Munich Khronos Chapter meeting on the 13th of October 2017.

So please come along and watch myself, and the other great speakers, talk about some fun things you can do with Vulkan!

Look forward to seeing y’all there.