05 Apr

Optimizing LLVM IR from the C API

After my previous post on how to read & write LLVM bitcode, I thought I’d follow it up with a post on actually modifying LLVM bitcode files after you’ve read them. LLVM comes with extensive built-in optimization passes, but also plenty of scope to do your own optimizations too.

First off, remember that that when we parse an LLVM bitcode file, we get an LLVM module. So what exactly is an LLVM module?

An LLVM module is basically a collection of global variables and functions. Functions contain basic-blocks. The simplest way to think about a basic block is a scope in a C/C++ file:

int foo(int a, int b, bool c) {
  // basic-block 0 is here
  if (c) {
    // basic-block 1 is here
    return a;
  } else {
    // basic-block 2 is here
    return b;
  }
}

In the above example, we have three basic-blocks. There is always an entry basic-block, one that is tied to the function itself. Then that basic-block can branch to one or more other basic-blocks. Each basic-block contains one or more instructions.

So now we know the basics of how LLVM is put together, lets look at actually doing an optimization. Just to keep with something trivial, lets do constant folding. Constant folding is when you have an instruction that takes only constant arguments, and so you can replace the instruction with a single constant value instead:

int foo() {
  return 13 + 42;
}

In the above example you really don’t want the compiler to actually do 13 + 42 at runtime, what you want is that it instead uses the constant 55 instead. LLVM already has this sort of optimization, but lets do it ourselves to work through the process.

We’ll integrate our new code into the parser I used in my last blog post. Looking back at how we got our LLVM module:

LLVMModuleRef module;
if (0 != LLVMParseBitcode2(memoryBuffer, &module)) {
  fprintf(stderr, "Invalid bitcode detected!\n");
  LLVMDisposeMemoryBuffer(memoryBuffer);
  return 1;
}

So we have our LLVM module, now lets start to look for instructions that use constant arguments. So given an LLVM module, we first have to walk the functions within that module:

for (LLVMValueRef function =
  LLVMGetFirstFunction(module);
  function;
  function =
  LLVMGetNextFunction(function)) {
  // ...
}

Then, we walk the basic-blocks of each function:

for (LLVMBasicBlockRef basicBlock =
  LLVMGetFirstBasicBlock(function);
  basicBlock;
  basicBlock =
  LLVMGetNextBasicBlock(basicBlock)) {
  // ...
}

And finally we walk the instructions in each basic-block:

for (LLVMValueRef instruction =
  LLVMGetFirstInstruction(basicBlock);
  instruction;
  instruction =
  LLVMGetNextInstruction(instruction)) {
  // ...
}

So now we are walking every instruction, of every basic-block, of every function, in our module. Now we need to identify instructions that we want to investigate. Just to keep things simple, we’ll only try and fold binary operators – things like + – * /.

if (LLVMIsABinaryOperator(instruction)) {

Once we know we’ve got a binary operator, we know we’ve got exactly two operands:

LLVMValueRef x = LLVMGetOperand(instruction, 0);
LLVMValueRef y = LLVMGetOperand(instruction, 1);

And to check if we have constant operands or not we simply do:

const int allConstant = LLVMIsAConstant(x) && LLVMIsAConstant(y);

So if allConstant is true, we know we can fold the binary operator. To fold the operation, we look at the opcode of the binary operator – this identifies which binary operator it actually is. Then, we turn the binary operator into a constant expression which does the same thing. LLVM will do all the hard work for us when we create a constant expression with known constant values – it’ll do the constant folding we want for us.

LLVMValueRef replacementValue = 0;

if (allConstant) {
  switch (LLVMGetInstructionOpcode(instruction)) {
  default:
    break;
  case LLVMAdd:
    replacementValue = LLVMConstAdd(x, y);
    break;
  case LLVMFAdd:
    replacementValue = LLVMConstFAdd(x, y);
    break;
  case LLVMSub:
    replacementValue = LLVMConstSub(x, y);
    break;
  case LLVMFSub:
    replacementValue = LLVMConstFSub(x, y);
    break;
  case LLVMMul:
    replacementValue = LLVMConstMul(x, y);
    break;
  case LLVMFMul:
    replacementValue = LLVMConstFMul(x, y);
    break;
  case LLVMUDiv:
    replacementValue = LLVMConstUDiv(x, y);
    break;
  case LLVMSDiv:
    replacementValue = LLVMConstSDiv(x, y);
    break;
  case LLVMFDiv:
    replacementValue = LLVMConstFDiv(x, y);
    break;
  case LLVMURem:
    replacementValue = LLVMConstURem(x, y);
    break;
  case LLVMSRem:
    replacementValue = LLVMConstSRem(x, y);
    break;
  case LLVMFRem:
    replacementValue = LLVMConstFRem(x, y);
    break;
  case LLVMShl:
    replacementValue = LLVMConstShl(x, y);
    break;
  case LLVMLShr:
    replacementValue = LLVMConstLShr(x, y);
    break;
  case LLVMAShr:
    replacementValue = LLVMConstAShr(x, y);
    break;
  case LLVMAnd:
    replacementValue = LLVMConstAnd(x, y);
    break;
  case LLVMOr:
    replacementValue = LLVMConstOr(x, y);
    break;
  case LLVMXor:
    replacementValue = LLVMConstXor(x, y);
    break;
  }
}

Now we have our replacement value for the binary operator, we can replace the original instruction with the new constant:

// if we managed to find a more optimal replacement
if (replacementValue) {
  // replace all uses of the old instruction with the new one
  LLVMReplaceAllUsesWith(instruction, replacementValue);

  // erase the instruction that we've replaced
  LLVMInstructionEraseFromParent(instruction);
}

And that’s it! Lets run this on a simple example I’ve knocked together:

define i32 @foo() {
  %a = mul i32 13, 46
  %1 = add i32 4, %a
  %2 = sub i32 %1, 400
  %3 = shl i32 %2, 2
  ret i32 %3
}

We can run our code and everything… explodes? You’ll get some horrific segfault deep in LLVM and run away screaming, if you didn’t know what the cause is. Basically when hacking with an LLVM module, you’ve got to be super careful about deleting instructions, basic-blocks or functions while you are still iterating through the lists. You either need to store all the values and do all the replacements after you’ve finished with the iterators, or handle the replacements very carefully.

So the way I do it when iterating using the C API is to remember the last instruction in the instruction stream, and then if I happen to replace the current instruction and delete it, I know I can look at the last instruction and pick up the remainder of the instruction stream from there.

LLVMValueRef lastInstruction = 0;

// loop through all the instructions in the basic block
for (LLVMValueRef instruction =
  LLVMGetFirstInstruction(basicBlock);
  instruction;) {
  LLVMValueRef replacementValue = 0;

  // ...

  // if we managed to find a more optimal replacement
  if (replacementValue) {
    // replace all uses of the old instruction with the new one
    LLVMReplaceAllUsesWith(instruction, replacementValue);

    // erase the instruction that we've replaced
    LLVMInstructionEraseFromParent(instruction);

    // if we don't have a previous instruction, get the first
    // one from the basic block again
    if (!lastInstruction) {
      instruction = LLVMGetFirstInstruction(basicBlock);
    } else {
      instruction = LLVMGetNextInstruction(lastInstruction);
    }
  } else {
    lastInstruction = instruction;
    instruction = LLVMGetNextInstruction(instruction);
  }
}

So now we are correctly handling the case where we are removing instructions within the instruction stream. So if we compile and run everything again, what do we get? So for the example I showed above, after I’ve run my bitcode read -> constant fold -> bitcode write, I get the following LLVM IR:

define i32 @foo() {
  ret i32 808
}

Nice! That looks like much more optimal code.

I will note that this is not how real LLVM passes work (they are full C++ with classes and templates), but it does allow you to easily work with LLVM IR yourself.

I’ve updated my llvm_bc_parse_example to include this optimization code.

I hope this proves useful to any budding compiler engineers who want to start tinkering with LLVM!

29 Mar

How to read & write LLVM bitcode

I’ve read multiple posts on social media now complaining about how scary LLVM is to get to grips with. It doesn’t help that the repository is ginormous, there are frequently hundreds of commits a day, the mailing list is nearly impossible to keep track of, and the resulting executables are topping 40Mb now…

Those tidbits aside – LLVM is super easy to work with once you get to grips with the beast that it is. To help aid people in using LLVM, I thought I’d put together the most trivial no-op example you can do with LLVM – parsing one of LLVM’s intermediate representation files (known as bitcode, file extension .bc) and then writing it back out.

Firstly, lets go through some high level LLVM terms:

  • LLVM’s main abstraction for user code is the Module. It’s a class that contains all the functions, global variables, and instructions for the code you or other users write.
  • Bitcode files are effectively a serialization of an LLVM Module such that it can be reconstructed in a different program later.
  • LLVM uses MemoryBuffer objects to handle data that comes from files, stdin, or arrays.

For my example, we’ll use the LLVM C API – a more stable abstraction ontop of LLVM’s core C++ headers. The C API is really useful if you’ve got code that you want to work with multiple versions of LLVM, it’s significantly more stable than the LLVM C++ headers. (An aside, I use LLVM extensively for my job and nearly every week some LLVM C++ header change will break our code. I’ve never had the C API break my code.)

First off, I’m going to assume you’ve pulled LLVM, built and installed it. Some simple steps to do this:

git clone https://git.llvm.org/git/llvm.git <llvm dir>
cd <llvm dir>
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=install ..
cmake --build . --target install

After doing the above, you’ll have an LLVM install in <llvm dir>/build/install!

So for our little executable I’ve used CMake. CMake is by far the easiest way to integrate with LLVM as it is the build system LLVM also uses.

project(llvm_bc_parsing_example)
cmake_minimum_required(VERSION 3.4.3)

# option to allow a user to specify where an LLVM install is on the system
set(LLVM_INSTALL_DIR "" CACHE STRING "An LLVM install directory.")

if("${LLVM_INSTALL_DIR}" STREQUAL "")
  message(FATAL_ERROR "LLVM_INSTALL_DIR not set! Set it to the location of an LLVM install.")
endif()

# fixup paths to only use the Linux convention
string(REPLACE "\\" "/" LLVM_INSTALL_DIR ${LLVM_INSTALL_DIR})

# tell CMake where LLVM's module is
list(APPEND CMAKE_MODULE_PATH ${LLVM_INSTALL_DIR}/lib/cmake/llvm)

# include LLVM
include(LLVMConfig)

add_executable(llvm_bc_parsing_example main.c)

target_include_directories(llvm_bc_parsing_example PUBLIC ${LLVM_INCLUDE_DIRS})

target_link_libraries(llvm_bc_parsing_example PUBLIC LLVMBitReader LLVMBitWriter)

So now we’ve got our CMake setup, and we can use our existing LLVM install, we can now get working on our actual C code!

So to use the LLVM C API there is one header you basically always need:

#include <llvm-c/Core.h>

And two extra headers we need for our executable are the bitcode reader and writer:

#include <llvm-c/BitReader.h>
#include <llvm-c/BitWriter.h>

Now we create our main function. I’m assuming here that we always take exactly 2 command line arguments, the first being the input file, the second being the output file. LLVM has a system whereby if a file named ‘-‘ is provided, that means read from stdin or write to stdout, so I decided to support that too:

if (3 != argc) {
  fprintf(stderr, "Invalid command line!\n");
  return 1;
}

const char *const inputFilename = argv[1];
const char *const outputFilename = argv[2];

So first we parse the input file. We’ll create an LLVM memory buffer object from either stdin, or a filename:

LLVMMemoryBufferRef memoryBuffer;

// check if we are to read our input file from stdin
if (('-' == inputFilename[0]) && ('\0' == inputFilename[1])) {
  char *message;
  if (0 != LLVMCreateMemoryBufferWithSTDIN(&memoryBuffer, &message)) {
    fprintf(stderr, "%s\n", message);
    free(message);
    return 1;
  }
} else {
  char *message;
  if (0 != LLVMCreateMemoryBufferWithContentsOfFile(
               inputFilename, &memoryBuffer, &message)) {
    fprintf(stderr, "%s\n", message);
    free(message);
    return 1;
  }
}

So after this code, memoryBuffer will be usable to read our bitcode file into an LLVM module. So lets create the module!

// now create our module using the memory buffer
LLVMModuleRef module;
if (0 != LLVMParseBitcode2(memoryBuffer, &module)) {
  fprintf(stderr, "Invalid bitcode detected!\n");
  LLVMDisposeMemoryBuffer(memoryBuffer);
  return 1;
}

// done with the memory buffer now, so dispose of it
LLVMDisposeMemoryBuffer(memoryBuffer);

Once we’ve got our module, we no longer need the memory buffer, so we can free up the memory straight away. And that’s it! We’ve managed to take an LLVM bitcode file, deserialize it into an LLVM module, which we could (I’m not going to in this blog post at least!) fiddle with. So lets assume you’ve done all you wanted with the LLVM module, and want to write the sucker back out to a bitcode file again.

The approach is orthogonal to the reading approach, we look for the special filename ‘-‘ and handle accordingly:

// check if we are to write our output file to stdout
if (('-' == outputFilename[0]) && ('\0' == outputFilename[1])) {
  if (0 != LLVMWriteBitcodeToFD(module, STDOUT_FILENO, 0, 0)) {
    fprintf(stderr, "Failed to write bitcode to stdout!\n");
    LLVMDisposeModule(module);
    return 1;
  }
} else {
  if (0 != LLVMWriteBitcodeToFile(module, outputFilename)) {
    fprintf(stderr, "Failed to write bitcode to file!\n");
    LLVMDisposeModule(module);
    return 1;
  }
}

Lastly, we should be good citizens and clean up our garbage, so also delete the module to:

LLVMDisposeModule(module);

And that’s it! You are now able to parse and then write an LLVM bitcode file. I’ve put the full example up here on GitHub – https://github.com/sheredom/llvm_bc_parsing_example.

Maybe I’ll do a follow-up post at some point taking you through the basics of how to do things to an LLVM module, but for now, adieu!

11 Feb

Adding Latin and Greek support to utf8.h!

I’ve had a lot of people using my unicode library utf8.h since its release – thanks to all who’ve found the library useful and provided feedback!

One of the things that I’d scrimped on previously was my support for case-insensitive comparisons for letters beyond those in ASCII. I knew little about this, but when a user requested that my library also supported accented latin characters, and later greek symbols, I jumped to the occassion to add support.

The following utf8 functions utf8casecmp, utf8ncasecmp, utf8casestr, utf8isupper, utf8islower, utf8lwr, and utf8upr, have been modified to support the Latin-1 Supplement, Latin Extended-A, Latin Extended-B, and Greek & Coptic unicode sections. I’ve also added two new functions utf8lwrcodepoint and utf8uprcodepoint that’ll make a single codepoint upper or lower case.

The main logic of how you convert between the lower and upper cases is both slightly concise and utterly disgusting. Lets take a look at the code to convert a lower case codepoint to an upper case.

For ASCII characters, and also some Latin and Greek, the upper case codepoints are simply 32 places below the lower case ones:

if ((('a' <= cp) && ('z' >= cp)) ||
    ((0x00e0 <= cp) && (0x00f6 >= cp)) ||
    ((0x00f8 <= cp) && (0x00fe >= cp)) ||
    ((0x03b1 <= cp) && (0x03c1 >= cp)) ||
    ((0x03c3 <= cp) && (0x03cb >= cp))) {
  cp -= 32;
}

The next set of codepoints are offset by 1 between the lower and upper cased variants. Depending on whether the lower case codepoint was odd or even, we have two if statements that handle both cases:

if (((0x0100 <= cp) && (0x012f >= cp)) ||
    ((0x0132 <= cp) && (0x0137 >= cp)) ||
    ((0x014a <= cp) && (0x0177 >= cp)) ||
    ((0x0182 <= cp) && (0x0185 >= cp)) ||
    ((0x01a0 <= cp) && (0x01a5 >= cp)) ||
    ((0x01de <= cp) && (0x01ef >= cp)) ||
    ((0x01f8 <= cp) && (0x021f >= cp)) ||
    ((0x0222 <= cp) && (0x0233 >= cp)) ||
    ((0x0246 <= cp) && (0x024f >= cp)) ||
    ((0x03d8 <= cp) && (0x03ef >= cp))) {
  cp &= ~0x1;
}

if (((0x0139 <= cp) && (0x0148 >= cp)) ||
    ((0x0179 <= cp) && (0x017e >= cp)) ||
    ((0x01af <= cp) && (0x01b0 >= cp)) ||
    ((0x01b3 <= cp) && (0x01b6 >= cp)) ||
    ((0x01cd <= cp) && (0x01dc >= cp))) {
  cp -= 1;
  cp |= 0x1;
}

And lastly, for all other codepoints in the ranges that don’t have any sane approach whatsoever, we’ll fire them all into a single big switch statement:

switch (cp) {
  default: break;
  case 0x00ff: cp = 0x0178; break;
  case 0x0180: cp = 0x0243; break;
  case 0x01dd: cp = 0x018e; break;
  case 0x019a: cp = 0x023d; break;
  case 0x019e: cp = 0x0220; break;
  case 0x0292: cp = 0x01b7; break;
  case 0x01c6: cp = 0x01c4; break;
  case 0x01c9: cp = 0x01c7; break;
  case 0x01cc: cp = 0x01ca; break;
  case 0x01f3: cp = 0x01f1; break;
  case 0x01bf: cp = 0x01f7; break;
  case 0x0188: cp = 0x0187; break;
  case 0x018c: cp = 0x018b; break;
  case 0x0192: cp = 0x0191; break;
  case 0x0199: cp = 0x0198; break;
  case 0x01a8: cp = 0x01a7; break;
  case 0x01ad: cp = 0x01ac; break;
  case 0x01b0: cp = 0x01af; break;
  case 0x01b9: cp = 0x01b8; break;
  case 0x01bd: cp = 0x01bc; break;
  case 0x01f5: cp = 0x01f4; break;
  case 0x023c: cp = 0x023b; break;
  case 0x0242: cp = 0x0241; break;
  case 0x037b: cp = 0x03fd; break;
  case 0x037c: cp = 0x03fe; break;
  case 0x037d: cp = 0x03ff; break;
  case 0x03f3: cp = 0x037f; break;
  case 0x03ac: cp = 0x0386; break;
  case 0x03ad: cp = 0x0388; break;
  case 0x03ae: cp = 0x0389; break;
  case 0x03af: cp = 0x038a; break;
  case 0x03cc: cp = 0x038c; break;
  case 0x03cd: cp = 0x038e; break;
  case 0x03ce: cp = 0x038f; break;
  case 0x0371: cp = 0x0370; break;
  case 0x0373: cp = 0x0372; break;
  case 0x0377: cp = 0x0376; break;
  case 0x03d1: cp = 0x03f4; break;
  case 0x03d7: cp = 0x03cf; break;
  case 0x03f2: cp = 0x03f9; break;
  case 0x03f8: cp = 0x03f7; break;
  case 0x03fb: cp = 0x03fa; break;
};

With the above, we can handle all the lower/upper case variants for the Latin and Greek characters requested!

I hope these additions are found to be useful to my users, and if you’ve got any requests yourself feel free to file them here.

24 Dec

Introducing process.h!

I’ve long been after a cross-platform way to launch processes and interact with them for use from C & C++. After a brief scouring of the interwebs, I didn’t find any that matched my criteria:

  • No build system dependencies.
  • Works in C and C++.
  • Works on Windows, Linux and macOS.
  • Single header (or at least single header / single source file).

So I did what I always do when confronted with this issue – I wrote my own!

process.h is a cross-platform C & C++ single header library that works on Windows, Linux and macOS. It contains six functions that let you:

There are some gotchas to know about:

  • Waiting a process will close the FILE associated with the stdin of the child. I need to do this so that any process that is waiting and reading the entire contents of the stdin can finish.
  • You can destroy a process before it has completed – this will close the stdin, stdout, and stderr of the child process and free the handle of the child process. This will allow a child process to outlive the parent process if required – you need to call process_join before process_destroy if you want to wait and then destroy the launched process.

Hopefully this proves useful to some of my followers! I’ve recorded some todos of things I want to add to the library (which will include an interface change at a later date) – so stay tuned.

13 Nov

Cross compiling Sollya to Windows with Emscripten

One component of Codeplay’s ComputeAorta that I manage is our high precision maths library Abacus.

One major component of Abacus, and in fact all math libraries, are a requirement to have remez reduced polynomial approximations of functions. In the past we’ve made use of Maple, Mathematica, lolremez, our own fork of lolremez, and to be honest none of them have been satisfactory to our needs. We want a scriptable solution that we can use to bake the generated polynomials automatically into Abacus with minimal user intervention.

I was lucky enough to be involved in a twitter thread with Marc B. Reynolds where he pointed me at Sollya. It’s Linux only which sucks (I’m primarily a Windows developer), but I fired up a VM and tried it out – and I’ve got to say, its pretty darn good! The non-Windows support is a big issue though, so how to fix that?

Enter stage left – Emscripten!

So I’ve known about Emscripten for a while, but never had a really compelling reason to use it. I suddenly thought ‘I wonder if I could use Emscripten to compile Sollya to JavaScript, then use Node.js to run it on Windows?’.

Yep, you are right, I’m mad. This can’t be a good way to take software meant for Linux and cross compile it for Windows, right? That just made me all the more curious to see if it could work.

Sollya and all its dependencies

Sollya requires a bunch of different projects to work: libxml2GMP, MPFR, MPFI, fplll, and lastly Sollya itself. So I first downloaded all of these, built them all from source, and built Sollya using gcc on Linux – just to test that I could build it.

Then, using Emscripten’s emconfigure (which you place before the typical linux call to ./configure) it replaces any compiler usage with the Emscripten compiler emcc, we can try and build Sollya again but for JavaScript!

So I started with libxml2, which worked! And then onto GMP – and explosions. Some stack overflowing pointed me to Compiling GMP/MPFR with Emscripten which states that for some reason (I didn’t dig into why) Emscripten couldn’t compile GMP if the host platform was not 32 bits. I looked at the answer where it suggests you chroot and thought ‘that seems like a lot of work to mess with my current 64-bit VM that I do other things on, I’ll fire up a new VM to mess with’. But since I’m creating a new VM anyway, I decided to just create a 32-bit Ubuntu VM and use that instead (which meant less configuration work on my part).

So with my 32-bit VM, I started the whole process of compiling libxml2, gmp, mpfr, mpfi, fplll (wow I’m on a roll!) and finally I get to Sollya and… it failed.

Sollya and dlopen

Sollya makes use of dlopen, and thus the ./configure script in Sollya will check that dlopen is a command that works on the target platform. The problem is, ./configure doesn’t use the correct signature for the dlopen call – it just does:

 extern char dlopen();

and then ensures that the linker doesn’t complain when this is linked against -ldl. The signature of dlopen is:

void* dlopen(char*, int);

and Emscripten looks for that exact signature, and complains if the function doesn’t have the correct number and type of arguments. This meant as far as ./configure was concerned, the system didn’t have dlopen (even though Emscripten can stub implement it), and it failed.

Ever the hacker, I just decided to patch the ./configure to not error out:

sed -i -e "s/as_fn_error .. \"libdl unusable\"/echo \"skipped\"\#/" ./sollya/configure

tried to build again, and Sollya built!

Emscripten and .bc’s

Emscripten seems to output an LLVM bitcode (.bc) file by default – and I couldn’t work out how to tell emconfigure to output a JavaScript file instead.

So what I did was take the bitcode file that was in ‘sollya’ and used emcc directly to turn this into a JavaScript file.

emcc complained if the input bitcode file wasn’t named <something>.bc, so I first renamed it to sollya.bc:

cp sollya sollya.bc
emcc sollya.bc -o sollya.js

and I got a whopping 27MB JavaScript file out!

Next I used node to run this JavaScript against a simple test script I wrote:

print("Single precision:");
r=[1/sqrt(2)-1; sqrt(2)-1];
f=log2(1+x)/x;
p=fpminimax(f, 11, [|single...|], r, floating, relative);
p;
print("\nDouble precision:");
p=fpminimax(f, 21, [|double...|], r, floating, relative);
p;

and ran Node.js:

nodejs sollya.js < script.sollya

and it ran!

But it kept on running, like infinite-loop running – the thing just never stopped. I was getting a ton of ‘sigaction not implemented’ messages, so I wondered if Sollya was doing something really ugly with signals to handle exiting from a script. I thought about digging into it, then realised Sollya has an explicit ‘quit;’ command, so I added that to the bottom of the script:

print("Single precision:");
r=[1/sqrt(2)-1; sqrt(2)-1];
f=log2(1+x)/x;
p=fpminimax(f, 11, [|single...|], r, floating, relative);
p;
asd;
print("\nDouble precision:");
p=fpminimax(f, 21, [|double...|], r, floating, relative);
p;
quit;

and it ran and exited as expected.

> Single precision:
> Warning: at least one of the given expressions is not a constant but requires evaluation.
Evaluation is guaranteed to ensure the inclusion property. The approximate result is at least 165 bit accurate.
> > > 1.44269502162933349609375 + x * (-0.7213475704193115234375 + x * (0.4809020459651947021484375 + x * (-0.360668718814849853515625 + x * (0.2883343398571014404296875 + x * (-0.24055089056491851806640625 + x * (0.21089743077754974365234375 + x * (-0.1813324391841888427734375 + x * (0.10872711241245269775390625 + x * (-0.10412885248661041259765625 + x * (0.35098421573638916015625 + x * (-0.383228302001953125)))))))))))
> Warning: the identifier "asd" is neither assigned to, nor bound to a library function nor external procedure, nor equal to the current free variable.
Will interpret "asd" as "x".
x
> 
Double precision:
> > 1.44269504088896338700465094007086008787155151367187 + x * (-0.72134752044448169350232547003543004393577575683594 + x * (0.4808983469630028761976348050666274502873420715332 + x * (-0.36067376022224723053355432966782245784997940063477 + x * (0.288539008174513611493239295668900012969970703125 + x * (-0.24044917347913088989663776828820118680596351623535 + x * (0.20609929188248227172053361755388323217630386352539 + x * (-0.18033688048265933412395156665297690778970718383789 + x * (0.160299431107568057797152505372650921344757080078125 + x * (-0.144269475404082331282396012284152675420045852661133 + x * (0.13115467750388201673139576541871065273880958557129 + x * (-0.120225818807988840686284959247132064774632453918457 + x * (0.110964912764316969706612781010335311293601989746094 + x * (-0.103018221150312991318820365904684877023100852966309 + x * (9.6317404417675320238423353202961152419447898864746e-2 + x * (-9.0652910508716211257507211485062725841999053955078e-2 + x * (8.4035326134831819788750806310417829081416130065918e-2 + x * (-7.5783141066360651394440139938524225726723670959473e-2 + x * (7.650699022117241065998882731946650892496109008789e-2 + x * (-9.2331285631306825312236696845502592623233795166016e-2 + x * (8.7941823766079466051515112212655367329716682434082e-2 + x * (-3.8635539215562890447142052607887308113276958465576e-2)))))))))))))))))))))

So now I have a JavaScript file that works when I run it through Node.js, but we’ve got a couple of issues:

  • The JavaScript is freaking huge!
  • We don’t want to require Node.js to be installed either for our developers.

File size

Digging into Emscripten I found that there were a couple of options we could use:

  • -O3 – same as all compilers, we can specify that the compiler should optimize the code heavily.
  • -llvm-lto 2 – this enables all the optimizations to occur on the entire set of bitcode files once they are all linked together. This will allow for a ton more inlining to take place which should help our performance.

Adding both these options, the size of the produce sollya.js was 4.1MB! A whopping 6.5x reduction in file size – and its actually optimized properly now too.

Creating a standalone Windows binary?

So I’ve got sollya.js – and I can run this with Node.js on Windows and get actual valid polynomials. But I really want a standalone executable that has no dependencies, is this possible? Searching around, I found out about nexe – a way to bundle a Node.js application into a single executable. It basically puts Node.js and the JavaScript file into the same executable, and calls Node.js on the JavaScript at runtime. While this isn’t amazing – would it work?

First off – you have to use nexe on the platform you want to run the end executable on  – so I copied the sollya.js from my VM to my Windows host, and then after installing nexe I ran:

nexe -i sollya.js -f -o sollya.exe

And what do you know – I can run sollya.exe and it works as expected. The downside is that because the executable is shipping an entire copy of Node.js with it – sollya.exe is a whopping 29MB to ship around.

Performance

I’ve compared the natively compiled sollya executable with the JavaScript variant. I ran them 50 times, and averaged out the results.

sollya sollyajs  JS vs Native Performance
1.37144s 4.93946s 3.6x slower

So as expected – given that we are running through JavaScript and Node.js, we are 3.6x slower than the natively compiled executable. I’m honestly surprised we are not slower (I’d heard horror stories of 90x slowdowns with Emscripten) so this seems not too bad to me.

Conclusion

It seems that with Emscripten, in combination with Node.js and Nexe, we can compile a program on Linux to be run entirely on Windows – which is pretty freaking cool. There are probably many other more sane ways to do exactly this, but I find it pretty amazing that this is even possible. Now I can ‘natively’ run a Windows executable which will calculate all the polynomial approximations I need on Windows too – saving our team from having to have a Linux VM when re-generating the polynomials is required.

CMake script to build Sollya with Emscripten

In case anyone is interested, I use a CMake file to bring in all the dependencies and build Sollya using Emscripten.

cmake_minimum_required(VERSION 3.4)
project(emsollya)

include(ExternalProject)

ExternalProject_Add(libxml2
  PREFIX ${CMAKE_BINARY_DIR}/libxml2
  URL ftp://xmlsoft.org/libxml2/libxml2-git-snapshot.tar.gz
  PATCH_COMMAND NOCONFIGURE=1 sh ${CMAKE_BINARY_DIR}/libxml2/src/libxml2/autogen.sh
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/libxml2/src/libxml2/configure
    --disable-shared --without-python --prefix=${CMAKE_BINARY_DIR}/libxml2
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(gmp
  PREFIX ${CMAKE_BINARY_DIR}/gmp
  URL https://gmplib.org/download/gmp/gmp-6.1.2.tar.bz2
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/gmp/src/gmp/configure
    --disable-assembly --enable-cxx --disable-shared
    --prefix=${CMAKE_BINARY_DIR}/gmp
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(mpfr
  DEPENDS gmp
  PREFIX ${CMAKE_BINARY_DIR}/mpfr
  URL http://www.mpfr.org/mpfr-current/mpfr-3.1.6.tar.bz2
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/mpfr/src/mpfr/configure
    --disable-shared --with-gmp=${CMAKE_BINARY_DIR}/gmp
    --prefix=${CMAKE_BINARY_DIR}/mpfr
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(mpfi
  DEPENDS gmp mpfr
  PREFIX ${CMAKE_BINARY_DIR}/mpfi
  URL https://gforge.inria.fr/frs/download.php/file/30129/mpfi-1.5.1.tar.bz2
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/mpfi/src/mpfi/configure
    --disable-shared --with-gmp=${CMAKE_BINARY_DIR}/gmp
    --with-mpfr=${CMAKE_BINARY_DIR}/mpfr
    --prefix=${CMAKE_BINARY_DIR}/mpfi
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(fplll
  DEPENDS gmp mpfr
  PREFIX ${CMAKE_BINARY_DIR}/fplll
  GIT_REPOSITORY https://github.com/fplll/fplll.git
  GIT_TAG cd47f76b017762317245de7878c7b41eff9ab5d0
  PATCH_COMMAND sh ${CMAKE_BINARY_DIR}/fplll/src/fplll/autogen.sh
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/fplll/src/fplll/configure
    --disable-shared --with-gmp=${CMAKE_BINARY_DIR}/gmp
    --with-mpfr=${CMAKE_BINARY_DIR}/mpfr
    --prefix=${CMAKE_BINARY_DIR}/fplll
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(sollya
  DEPENDS gmp mpfr mpfi fplll libxml2
  PREFIX ${CMAKE_BINARY_DIR}/sollya
  URL http://sollya.gforge.inria.fr/sollya-weekly-11-05-2017.tar.bz2
  PATCH_COMMAND sed -i -e "s/as_fn_error .. \"libdl unusable\"/echo \"skipped\"\#/"
    ${CMAKE_BINARY_DIR}/sollya/src/sollya/configure
  CONFIGURE_COMMAND EMCONFIGURE_JS=1 emconfigure
    ${CMAKE_BINARY_DIR}/sollya/src/sollya/configure
    --disable-shared --with-gmp=${CMAKE_BINARY_DIR}/gmp
    --with-fplll=${CMAKE_BINARY_DIR}/fplll
    --with-mpfi=${CMAKE_BINARY_DIR}/mpfi
    --with-mpfr=${CMAKE_BINARY_DIR}/mpfr
    --with-xml2=${CMAKE_BINARY_DIR}/libxml2
    --prefix=${CMAKE_BINARY_DIR}/fplll
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Get_Property(sollya BINARY_DIR)

add_custom_command(OUTPUT ${CMAKE_BINARY_DIR}/sollya.js
  COMMAND cmake -E copy ${BINARY_DIR}/sollya ${CMAKE_BINARY_DIR}/sollya.bc
  COMMAND emcc --memory-init-file 0 -O3 --llvm-lto 2 ${CMAKE_BINARY_DIR}/sollya.bc -o ${CMAKE_BINARY_DIR}/sollya.js
  DEPENDS ${BINARY_DIR}/sollya
)

add_custom_target(sollya_js ALL DEPENDS ${CMAKE_BINARY_DIR}/sollya.js)

add_dependencies(sollya_js sollya)
06 Nov

LLVM & CRT – auto-magically selecting the correct CRT

LLVM comes with a really useful set of options LLVM_USE_CRT_<config> which allows you to specify a different C RunTime (CRT) when compiling with Visual Studio. If you want to be able to compile LLVM as a release build, but compile some code that uses it in debug (EG. our ComputeAorta product that allows customers to implement OpenCL/Vulkan on their hardware insanely quickly), Visual Studio will complain about mixing the differing version of the CRT. By using the LLVM_USE_CRT_<config> option, we can specify that LLVM compiles in a release build, but using a debug CRT.

There is one annoying catch with this though – compiling LLVM can be expensive to build. We’ll average 10 minutes build time for a full build of LLVM. We don’t want to recompile LLVM, and we don’t want to be constantly building different copies of LLVM everytime we pull in the latest commits for 2-4 different versions of the CRT. We want to be able to change a ComputeAorta build from debug/release without having to rebuild LLVM, and we want all this to just work™ without any manual input from a developer.

Changing LLVM

So what we need to do is detect which CRT LLVM was built against. My first thought was to allow LLVM to export which CRT it was built against into an LLVM install. LLVM already outputs an LLVMConfig.cmake during its install process, so why not just record what CRT was used too? I contacted the LLVM mailing list asking... and got no response. I’ve found in general if you are not a super active contributor and located in the bay area this is a common occurrence. Not wanting to be that guy that nags on the mailing list about things no-one else clearly cares about, how else could I solve it?

Detecting the CRT

So I reasoned that since the Visual Studio linker could detect and give me a good error message when I was accidentally mixing CRT versions, there must be some information recorded in the library files produced from Visual Studio that said which CRT the library was linked against. Using dumpbin.exe (which is included with Visual Studio) I first called:

$ dumpbin /? 
Microsoft (R) COFF/PE Dumper Version 14.00.24215.1 
Copyright (C) Microsoft Corporation. All rights reserved. 
 
usage: DUMPBIN [options] [files] 
 
 options: 
 
 /ALL 
 /ARCHIVEMEMBERS 
 /CLRHEADER 
 /DEPENDENTS 
 /DIRECTIVES 
 /DISASM[:{BYTES|NOBYTES}] 
 /ERRORREPORT:{NONE|PROMPT|QUEUE|SEND} 
 /EXPORTS 
 /FPO 
 /HEADERS 
 /IMPORTS[:filename] 
 /LINENUMBERS 
 /LINKERMEMBER[:{1|2}] 
 /LOADCONFIG 
 /NOLOGO 
 /OUT:filename 
 /PDATA 
 /PDBPATH[:VERBOSE] 
 /RANGE:vaMin[,vaMax] 
 /RAWDATA[:{NONE|1|2|4|8}[,#]] 
 /RELOCATIONS 
 /SECTION:name 
 /SUMMARY 
 /SYMBOLS 
 /TLS 
 /UNWINDINFO

And through a process of elimination I ran the ‘/DIRECTIVES’ command against one of the .lib files in LLVM which gave:

$ dumpbin /DIRECTIVES LLVMCore.lib
Microsoft (R) COFF/PE Dumper Version 14.00.24215.1
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file LLVMCore.lib

File Type: LIBRARY

   Linker Directives
   -----------------
   /FAILIFMISMATCH:_MSC_VER=1900
   /FAILIFMISMATCH:_ITERATOR_DEBUG_LEVEL=2
   /FAILIFMISMATCH:RuntimeLibrary=MDd_DynamicDebug
   /DEFAULTLIB:msvcprtd
   /FAILIFMISMATCH:_CRT_STDIO_ISO_WIDE_SPECIFIERS=0
   /FAILIFMISMATCH:LLVM_ENABLE_ABI_BREAKING_CHECKS=1
   /DEFAULTLIB:MSVCRTD
   /DEFAULTLIB:OLDNAMES

...

And what do you know ‘/FAILIFMISMATCH:RuntimeLibrary=MDd_DynamicDebug’ is telling the linker to output an error message if the CRT is not the dynamic debug variant! So now I have a method of detecting the CRT from one of LLVM’s libraries, how to incorporate that in our build?

CMake Integration

LLVM uses CMake for its builds, and thus we also use CMake for our builds. We already include LLVM by specifying the location of an LLVM install like:

$ cmake -DCA_LLVM_INSTALL_DIR=<directory> .
-- Overriding option 'CA_LLVM_INSTALL_DIR' to '<directory>' (default was '').

And then within our CMake we do:

# Setup LLVM/Clang search paths.
list(APPEND CMAKE_MODULE_PATH
  ${CA_LLVM_INSTALL_DIR}/lib/cmake/llvm
  ${CA_LLVM_INSTALL_DIR}/lib/cmake/clang)

# Include LLVM.
include(LLVMConfig)

# Include Clang.
include(ClangTargets)

So I added a new DetectLLVMMSVCCRT.cmake to our CMake modules and included it just after the ClangTargets include. This does the following:

  • Get the directory of CMAKE_C_COMPILER (always cl.exe in our case).
  • Look for dumpbin.exe in the same directory.
  • Get the location of LLVMCore.lib.
    • My reasoning is that most libraries in LLVM could change over time, but the core library of LLVM is unlikely to be moved (I hope!).
  • Run dumpbin /DIRECTIVES LLVMCore.lib
    • Find the first usage of ‘/FAILIFMISMATCH:RuntimeLibrary=’
    • Get the string that occurs between ‘/FAILIFMISMATCH:RuntimeLibrary=’ and the next ‘_’

And then we’ve got the CRT we need to use to build with. To actually set the CRT to use, we can just call LLVM’s ChooseMSVCCRT.cmake (that ships in an LLVM install), specifying the LLVM_USE_CRT_<config> variables and voila, we’ll be using the same CRT as LLVM, and get no linker errors!

The full CMake script is:

if(NOT CMAKE_SYSTEM_NAME STREQUAL Windows)
  return()
endif()

# Get the directory of cl.exe
get_filename_component(tools_dir "${CMAKE_C_COMPILER}" DIRECTORY)

# Find the dumpbin.exe executable in the directory of cl.exe
find_program(dumpbin "dumpbin.exe" PATHS "${tools_dir}" NO_DEFAULT_PATH)

if("${dumpbin}" STREQUAL "dumpbin-NOTFOUND")
  message(WARNING "Could not detect which CRT LLVM was built against - "
                  "could not find 'dumpbin.exe'.")
  return()
endif()

# Get the location in the file-system of LLVMCore.lib
get_target_property(llvmcore LLVMCore LOCATION)

if("${llvmcore}" STREQUAL "llvmcore-NOTFOUND")
  message(WARNING "Could not detect which CRT LLVM was built against - "
                  "could not find location of 'LLVMCore.lib'.")
  return()
endif()

# Get the directives that LLVMCore.lib contains
execute_process(COMMAND "${dumpbin}" "/DIRECTIVES" "${llvmcore}"
  OUTPUT_VARIABLE output)

# Find the first directive specifying what CRT to use
string(FIND "${output}" "/FAILIFMISMATCH:RuntimeLibrary=" position)

# Strip away everything but the directive we want to examine
string(SUBSTRING "${output}" ${position} 128 output)

# Remove the directive prefix which we don't need
string(REPLACE "/FAILIFMISMATCH:RuntimeLibrary=" "" output "${output}")

# Get the position of the '_' character that breaks the CRT from all else
string(FIND "${output}" "_" position)

# Substring output to be one of the four CRT values: MDd MD MTd MT
string(SUBSTRING "${output}" 0 ${position} output)

# Set all possible CMAKE_BUILD_TYPE's to the CRT that LLVM was linked against
set(LLVM_USE_CRT_DEBUG "${output}")
set(LLVM_USE_CRT_RELWITHDEBINFO "${output}")
set(LLVM_USE_CRT_MINSIZEREL "${output}")
set(LLVM_USE_CRT_RELEASE "${output}")

# Include the LLVM cmake module to choose the correct CRT
include(ChooseMSVCCRT)

Conclusion

We’ve been able to do what we set out to do – auto-magically make our project that uses an LLVM install work reliably even with mixed Debug/Release builds. This has reduced the number of LLVM compiles I do daily by 2x (yay) and also allowed me to stop tracking (and caring) about CRT conflicts and how to avoid them.

28 Oct

Slides from my Khronos Munich Chapter talk

I gave a talk on Friday 13th of October 2017 at the Khronos Munich Chapter titled ‘OpenCL to Vulkan: A Porting Guide’. I covered how to port from the OpenCL API to the Vulkan API, some common problems our customers have faced, and how to fix them. The slides are available here.

The talk covered some of the major pitfalls our customers have had in porting OpenCL applications to Vulkan, and also briefly covered the work we did in collaboration with Google and Adobe – clspv.

I hope the slide deck is useful to those of you who couldn’t attend in person.

07 Sep

I’m speaking at the Munich Khronos Chapter Meeting 13th October 2017

Previously I had begun a series of blog posts detailing how to port applications from OpenCL -> Vulkan.

  1. OpenCL -> Vulkan: A Porting Guide (#1)
  2. OpenCL -> Vulkan: A Porting Guide (#2)
  3. OpenCL -> Vulkan: A Porting Guide (#3)

Instead of continuing this blog series, I’m converting the entire contents into a slide deck, and will be presenting it at the Munich Khronos Chapter meeting on the 13th of October 2017.

So please come along and watch myself, and the other great speakers, talk about some fun things you can do with Vulkan!

Look forward to seeing y’all there.

29 Jun

OpenCL -> Vulkan: A Porting Guide (#3)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is a series of blog posts on how to port from OpenCL to Vulkan:

  1. OpenCL -> Vulkan: A Porting Guide (#1)
  2. OpenCL -> Vulkan: A Porting Guide (#2)

In this post, we’ll cover the different queue synchronization mechanisms in OpenCL and Vulkan.

clFinish vs vkWaitForFences

In the previous post I explained that an OpenCL queue (cl_command_queue) was an amalgamation of two distinct concepts:

  1. A collection of workloads to run on some hardware
  2. A thing that will run various workloads and allow interactions between them

Whereas Vulkan uses a VkCommandBuffer for 1, and a VkQueue for 2.

One common synchronization users want to do is let a queue execute a bunch of work, and wait for all that work to be done.

In OpenCL, you can wait on all previously submitted commands to a queue by using clFinish.

cl_command_queue queue; // previously created

// submit work to the queue
if (CL_SUCCESS != clFinish(queue)) {
  // ... error!
}

In Vulkan, because a queue is just a thing to run workloads on, we instead have to wait on the command buffer itself to complete. This is done via a VkFence which is specified when submitting work to a VkQueue.

VkCommandBuffer commandBuffer; // previously created
VkFence fence; // previously created

// submit work to the commandBuffer

VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    fence)) {
  // ... error!
}

if (VK_SUCCESS != vkWaitForFences(
    device,
    1,
    &fence,
    VK_TRUE,
    UINT64_MAX)) {
  // ... error!
}

One thing to note is that you can wait on a Vulkan queue to finish all submitted workloads, but remember the difference between Vulkan queues and OpenCL queues. Vulkan queue’s are retrieved from a device. If multiple parts of your code (including third party libraries) retrieve the same Vulkan queue and are executing workloads on it, you will end up waiting for someone else’s work to complete.

TL;DR – waiting on a queue in Vulkan is not the same as OpenCL.

Dependencies within a cl_command_queue / VkCommandBuffer

Both OpenCL and Vulkan have mechanisms to ensure a command will only begin executing once another command has completed.

Firstly, remember that an OpenCL command queue by default will be in order. What this means is that by default when you submit commands into an OpenCL command queue each command will only begin executing once the preceding command has completed. While this isn’t ideal in a number of situations for performance, it is advantageous for users to get up and running in a safe and quick manner.

OpenCL also allows command queue’s to be out of order. This means that commands submitted to a queue are guaranteed to be dispatched in order but that they may run concurrently and/or complete out of order.

Using an out of order OpenCL queue, to get commands to wait on other commands before beginning executing, you use a cl_event to create a dependency between both the commands.

cl_buffer bufferA, bufferB, bufferC; // previously created

cl_event event;

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    bufferA,
    bufferB,
    0,
    0,
    42,
    0,
    nullptr,
    &event)) {
  // ... error!
}

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    bufferB,
    bufferC,
    0,
    0,
    42,
    1,
    &event,
    nullptr)) {
  // ... error!
}

We can guarantee that if queue above was an out of order queue, the commands would still be executed in order because we expressed the dependency between both commands.

In Vulkan queues are out of order. There is also no exact matching mechanism to get two arbitrary commands to depend on one another. Vulkan relies on more knowledge of what you are actually trying to do to create the right kind of synchronization between commands.

The easiest (and in no way more performant) way to map OpenCL code with an event dependency between two commands, or if the OpenCL queue was created in order, is to have separate Vulkan command buffers for each command. While this might seem crude, it’ll allow you to use another of Vulkan’s synchronization mechanisms to solve the problem – the semaphore.

VkBuffer bufferA, bufferB, bufferC; // previously created
VkCommandBuffer commandBuffer1; // previously created
VkCommandBuffer commandBuffer2; // previously created

VkSemaphoreCreateInfo semaphoreCreateInfo = {
  VK_STRUCTURE_TYPE_SEMAPHORE_CREATE_INFO,
  nullptr,
  0
};

VkSemaphore semaphore;

if (VK_SUCCESS != vkCreateSemaphore(
    device,
    &semaphoreCreateInfo,
    nullptr,
    &semaphore)) {
  // ... error!
}

VkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer1,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer1, bufferA, bufferB, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer1)) {
  // ... error!
}
VkSubmitInfo submitInfo1 = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer1,
  1,
  &semaphore,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo1,
    nullptr)) {
  // ... error!
}

VkPipelineStageFlags pipelineStageFlags =
    VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer2,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer2, bufferB, bufferC, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer2)) {
  // ... error!
}

VkSubmitInfo submitInfo2 = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  1,
  &semaphore,
  &pipelineStageFlags,
  1,
  &commandBuffer2,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo2,
    nullptr)) {
  // ... error!
}

A Vulkan semaphore allows you to express dependencies between command buffers. So by placing each command into a command buffer we can use a semaphore between these command buffers to emulate the OpenCL behaviour of in order queues and arbitrary command dependencies.

As with everything in Vulkan – the way to get performance is to explain to the driver exactly what you intend to do. In our example where we are copying data from buffer A -> buffer B -> buffer C above, we are basically creating a dependency on our usage of buffer B. The copy from buffer B -> buffer C cannot begin until the copy from buffer A -> buffer B has complete. So Vulkan gives us the tools to tell the driver about this dependency explicitly, and we can use them within a single command buffer.

The most analogous approach to the OpenCL example is to use a Vulkan event to encode the dependency.

VkEventCreateInfo eventCreateInfo = {
  VK_STRUCTURE_TYPE_EVENT_CREATE_INFO,
  nullptr,
  0
};

VkEvent event;

if (VK_SUCCESS != vkCreateEvent(
    device,
    &eventCreateInfo,
    nullptr,
    &event)) {
  // ... error!
}

Note that we create the event explicitly with Vulkan, unlike in OpenCL where any clEnqueue* command has an optional out_event parameter as the last parameter.

VkBuffer bufferA, bufferB, bufferC; // previously created
VkCommandBuffer commandBuffer; // previously created

VkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer, bufferA, bufferB, 1, &bufferCopy);

vkCmdSetEvent(
    commandBuffer, 
    event,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT);

VkMemoryBarrier memoryBarrier = {
  VK_STRUCTURE_TYPE_MEMORY_BARRIER,
  nullptr,
  VK_ACCESS_MEMORY_WRITE_BIT,
  VK_ACCESS_MEMORY_READ_BIT
};

vkCmdWaitEvents(
    commandBuffer,
    1,
    &event,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    1,
    &memoryBarrier,
    0,
    nullptr,
    0,
    nullptr)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer, bufferB, bufferC, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer)) {
  // ... error!
}
VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    nullptr)) {
  // ... error!
}

So to do a similar thing to OpenCL’s event chaining semantics we:

  1. add our buffer A -> buffer B copy command
  2. set an event that will trigger when all previous commands are complete, in our case the current set of all previous commands is the one existing copy buffer command
  3. wait for the previous event to complete, specifying that all memory operations that performed a write before this wait must be resolved, and that all read operations after this event can read them
  4. add our buffer B -> buffer C copy command

Now we can be even more explicit with Vulkan and specifically use VK_ACCESS_TRANSFER_READ_BIT and VK_ACCESS_TRANSFER_WRITE_BIT – but I’m using the much more inclusive VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT to be clear what OpenCL will be doing implicitly for you as a user.

Dependencies between multiple cl_command_queue’s / VkCommandBuffer’s

When synchronizing between multiple cl_command_queue’s in OpenCL we use the exact same mechanism as with one queue.

cl_buffer bufferA, bufferB, bufferC; // previously created
cl_command_queue queue1, queue2; // previously created

cl_event event;

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue1,
    bufferA,
    bufferB,
    0,
    0,
    42,
    0,
    nullptr,
    &event)) {
  // ... error!
}

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue2,
    bufferB,
    bufferC,
    0,
    0,
    42,
    1,
    &event,
    nullptr)) {
  // ... error!
}

The command queue queue2 will not begin executing the copy buffer command until the first command queue queue1 has completed its execution. Having the same mechanism for creating dependencies within a queue and outwith a queue is a very nice thing from a user perspective – there is one true way to create a synchronization between commands in OpenCL.

In Vulkan, when we are wanting to create a dependency between two VkCommandBuffer’s the easiest way is to use the semaphore approach I showed above. You could also use a VkEvent that is triggered at the end of one command buffer and waited on at the beginning of another. If you want to amortize the cost of doing multiple submits to the same queue, then use the event approach.

You can also use both of these mechanisms to create dependencies between multiple Vulkan queues. Remember that a Vulkan queue can be thought of as an exposition of some physical concurrency in the hardware, or in other words, running things on two distinct queues concurrently can lead to a performance improvement.

I recommend using a semaphore as the mechanism to encode dependencies between queues for the most part as it is simpler to get right.

The main place where using the event approach is when you have a long command buffer, where after only a few commands you can unblock the concurrently runnable queue to begin execution. In this case you’d be better using an event as that will enable the other queue to begin executing much earlier than would previously be possible.

clEnqueueBarrierWithWaitList vs vkCmdPipelineBarrier

Both OpenCL and Vulkan have a barrier that acts as a memory and execution barrier. When you have a pattern whereby you have N commands that must have completed execution before another M commands begin, a barrier is normally the answer.

// N commands before here...

if (CL_SUCCESS != clEnqueueBarrierWithWaitList(
    queue,
    0,
    nullptr,
    nullptr)) {
  // ... error!
}

// M commands after here will only begin once
// the previous N commands have completed!

And the corresponding Vulkan:

VkMemoryBarrier memoryBarrier = {
  VK_STRUCTURE_TYPE_MEMORY_BARRIER,
  nullptr,
  VK_ACCESS_MEMORY_WRITE_BIT,
  VK_ACCESS_MEMORY_READ_BIT
};

// N commands before here...

vkCmdPipelineBarrier(
    commandBuffer,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    1,
    &memoryBarrier,
    0,
    nullptr,
    0,
    nullptr)) {
  // ... error!
}

// M commands after here will only begin once
// the previous N commands have completed!

What’s next?

After this monstrous dive into porting OpenCL’s synchronization mechanisms to Vulkan, in the next post we’ll look at the differences between OpenCL’s kernels and Vulkan’s pipelines – stay tuned!

16 Jun

OpenCL -> Vulkan: A Porting Guide (#2)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is a series of blog posts on how to port from OpenCL to Vulkan:

  1. OpenCL -> Vulkan: A Porting Guide (#1)

In this post, we’ll cover porting from OpenCL’s cl_command_queue to Vulkan’s VkQueue.

cl_command_queue -> VkCommandBuffer and VkQueue

OpenCL made a poor choice when cl_command_queue was designed. A cl_command_queue is an amalgamation of two very distinct things:

  1. A collection of workloads to run on some hardware
  2. A thing that will run various workloads and allow interactions between them

Vulkan broke this into the two constituent parts, for 1. we have a VkCommandBuffer, an encapsulation of one or more commands to run on a device. For 2. we have a VkQueue, the thing that will actually run these commands and allow us to synchronize on the result.

Without diving too deeply, Vulkan’s approach allows for a selection of commands to be built once, and then run multiple times. For a huge number of compute workloads we run on datasets, we’re running the same set of commands thousands of times – and Vulkan allows us to amortise the cost of building up this collection of commands to run.

Back to OpenCL, we use clCreateCommandQueue (for pre 2.0) / clCreateCommandQueueWithProperties to create this amalgamated ‘collection of things I want you to run and a way of running them’. We’ll enable CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE as that is the behaviour of a Vulkan VkQueue (although remember that not all OpenCL devices actually support out of order queues – I’m doing this to allow the mental mapping of how Vulkan executes command buffers on queues to bake into your mind).

cl_queue_properties queueProperties[3] = {
    CL_QUEUE_PROPERTIES,
    CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
    0
};

cl_command_queue queue = clCreateCommandQueueWithProperties(
    context,
    device,
    queueProperties,
    &errorcode);

if (CL_SUCCESS != errorcode) {
 // ... error!
}

The corresponding object in Vulkan is the VkQueue – which we get from the device, rather than creating as OpenCL does. This is because a queue in Vulkan is more like a physical aspect of the device, rather than some software construct – this isn’t mandated in the specification, but its a useful mental model to adopt when thinking about Vulkan’s queues.

Remember that when we created our VkDevice we requested which queue families we wanted to use with the device? Now to actually get a queue that supports compute, we have to choose one of the queue family indices that supported compute, and get the corresponding VkQueue from that queue family.

VkQueue queue;

uint32_t queueFamilyIndex = UINT32_MAX;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT & queueFamilyProperties[i].queueFlags) {
    queueFamilyIndex = i;
    break;
  }
}

if (UINT_MAX == queueFamilyIndex) {
  // ... error!
}

vkGetDeviceQueue(device, queueFamilyIndex, 0, &queue);

clEnqueue* vs vkCmd*

To actually execute something on a device, OpenCL uses commands that begin with clEnqueue* – this command will enqueue work onto a command queue and possibly begin execution it. Why possibly? OpenCL is utterly vague on when commands actually begin executing. The specification states that a call to clFlush, clFinish, or clWaitForEvents on an event that is being signalled by a previously enqueued command on a command queue will guarantee that the device has actually begun executing. It is entirely valid that an implementation begin executing work when the clEnqueue* command is called, and equally valid that the implementation delays until a bunch of clEnqueue* commands are in the queue and the corresponding clFlush/clFinish/clWaitForEvents is called.

cl_mem src, dst; // Two previously created buffers

cl_event event;
if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    src,
    dst,
    0, // src offset
    0, // dst offset
    42, // size in bytes to copy
    0,
    nullptr,
    &event)) {
  // ... error!
}

// If we were going to enqueue more stuff on the command queue,
// but wanted the above command to definitely begin execution,
// we'd call flush here.
if (CL_SUCCESS != clFlush(queue)) {
  // ... error!
}

// We could either call finish...
if (CL_SUCCESS != clFinish(queue)) {
  // ... error!
}

// ... or wait for the event we used!
if (CL_SUCCESS != clWaitForEvents(1, &event)) {
  // ... error!
}

In contrast, Vulkan requires us to submit all our commands into a VkCommandBuffer. First we need to create the command buffer.

VkCommandPoolCreateInfo commandPoolCreateInfo = {
  VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO,
  0,
  0,
  queueFamilyIndex
};

VkCommandPool commandPool;

if (VK_SUCCESS != vkCreateCommandPool(
    device,
    &commandPoolCreateInfo,
    0,
    &commandPool)) {
  // ... error!
}

VkCommandBufferAllocateInfo commandBufferAllocateInfo = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO,
  0,
  commandPool,
  VK_COMMAND_BUFFER_LEVEL_PRIMARY,
  1 // We are creating one command buffer.
};

VkCommandBuffer commandBuffer;

if (VK_SUCCESS != vkAllocateCommandBuffers(
    device,
    &commandBufferAllocateInfo,
    &commandBuffer)) {
  // ... error!
}

Now we have our command buffer with which we can queue up commands to execute on a Vulkan queue.

VkBuffer src, dst; // Two previously created buffers

VkCommandBufferBeginInfo commandBufferBeginInfo = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO,
  0,
  VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT,
  0
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

vkCmdCopyBuffer(commandBuffer, src, dst, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer)) {
  // ... error!
}

VkFenceCreateInfo fenceCreateInfo = {
  VK_STRUCTURE_TYPE_FENCE_CREATE_INFO,
  0,
  0
};

VkFence fence;

if (VK_SUCESS != VkFenceCreateInfo(
    device,
    &fenceCreateInfo,
    0,
    &fence)) {
  // ... error!
}

VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    fence)) {
  // ... error!
}

// We can either wait on our commands to complete by fencing...
if (VK_SUCCESS != vkWaitForFences(
    device,
    1,
    &fence,
    VK_TRUE,
    UINT64_MAX)) {
  // ... error!
}

// ... or waiting for the entire queue to have finished...
if (VK_SUCCESS != vkQueueWaitIdle(queue)) {
  // ... error!
}

// ... or even for the entire device to be idle!
if (VK_SUCCESS != vkDeviceWaitIdle(device)) {
  // ... error!
}

Vulkan gives us many more ways to synchronize on host for when we are complete with our workload. We can specify a VkFence to our queue submission to wait on one of more command buffers in that submit, we can wait for the queue to be idle, or even wait for the entire device to be idle! Fences and command buffers can be reused by calling VkResetFences and VkResetCommandBuffer respectively – note that the command buffer can be reused for an entirely different set of commands to be executed. If you wanted to resubmit the exact same command buffer, you’d have to remove VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT flag in the VkCommandBufferBeginInfo struct above.

So a crucial thing to note here – synchronizing on a cl_command_queue is similar to a VkQueue, but the mechanisms are not identical.

We’ll cover these queue synchronization mechanisms in more detail in the next post in the series.