24 Dec

Introducing process.h!

I’ve long been after a cross-platform way to launch processes and interact with them for use from C & C++. After a brief scouring of the interwebs, I didn’t find any that matched my criteria:

  • No build system dependencies.
  • Works in C and C++.
  • Works on Windows, Linux and macOS.
  • Single header (or at least single header / single source file).

So I did what I always do when confronted with this issue – I wrote my own!

process.h is a cross-platform C & C++ single header library that works on Windows, Linux and macOS. It contains six functions that let you:

There are some gotchas to know about:

  • Waiting a process will close the FILE associated with the stdin of the child. I need to do this so that any process that is waiting and reading the entire contents of the stdin can finish.
  • You can destroy a process before it has completed – this will close the stdin, stdout, and stderr of the child process and free the handle of the child process. This will allow a child process to outlive the parent process if required – you need to call process_join before process_destroy if you want to wait and then destroy the launched process.

Hopefully this proves useful to some of my followers! I’ve recorded some todos of things I want to add to the library (which will include an interface change at a later date) – so stay tuned.

13 Nov

Cross compiling Sollya to Windows with Emscripten

One component of Codeplay’s ComputeAorta that I manage is our high precision maths library Abacus.

One major component of Abacus, and in fact all math libraries, are a requirement to have remez reduced polynomial approximations of functions. In the past we’ve made use of Maple, Mathematica, lolremez, our own fork of lolremez, and to be honest none of them have been satisfactory to our needs. We want a scriptable solution that we can use to bake the generated polynomials automatically into Abacus with minimal user intervention.

I was lucky enough to be involved in a twitter thread with Marc B. Reynolds where he pointed me at Sollya. It’s Linux only which sucks (I’m primarily a Windows developer), but I fired up a VM and tried it out – and I’ve got to say, its pretty darn good! The non-Windows support is a big issue though, so how to fix that?

Enter stage left – Emscripten!

So I’ve known about Emscripten for a while, but never had a really compelling reason to use it. I suddenly thought ‘I wonder if I could use Emscripten to compile Sollya to JavaScript, then use Node.js to run it on Windows?’.

Yep, you are right, I’m mad. This can’t be a good way to take software meant for Linux and cross compile it for Windows, right? That just made me all the more curious to see if it could work.

Sollya and all its dependencies

Sollya requires a bunch of different projects to work: libxml2GMP, MPFR, MPFI, fplll, and lastly Sollya itself. So I first downloaded all of these, built them all from source, and built Sollya using gcc on Linux – just to test that I could build it.

Then, using Emscripten’s emconfigure (which you place before the typical linux call to ./configure) it replaces any compiler usage with the Emscripten compiler emcc, we can try and build Sollya again but for JavaScript!

So I started with libxml2, which worked! And then onto GMP – and explosions. Some stack overflowing pointed me to Compiling GMP/MPFR with Emscripten which states that for some reason (I didn’t dig into why) Emscripten couldn’t compile GMP if the host platform was not 32 bits. I looked at the answer where it suggests you chroot and thought ‘that seems like a lot of work to mess with my current 64-bit VM that I do other things on, I’ll fire up a new VM to mess with’. But since I’m creating a new VM anyway, I decided to just create a 32-bit Ubuntu VM and use that instead (which meant less configuration work on my part).

So with my 32-bit VM, I started the whole process of compiling libxml2, gmp, mpfr, mpfi, fplll (wow I’m on a roll!) and finally I get to Sollya and… it failed.

Sollya and dlopen

Sollya makes use of dlopen, and thus the ./configure script in Sollya will check that dlopen is a command that works on the target platform. The problem is, ./configure doesn’t use the correct signature for the dlopen call – it just does:

 extern char dlopen();

and then ensures that the linker doesn’t complain when this is linked against -ldl. The signature of dlopen is:

void* dlopen(char*, int);

and Emscripten looks for that exact signature, and complains if the function doesn’t have the correct number and type of arguments. This meant as far as ./configure was concerned, the system didn’t have dlopen (even though Emscripten can stub implement it), and it failed.

Ever the hacker, I just decided to patch the ./configure to not error out:

sed -i -e "s/as_fn_error .. \"libdl unusable\"/echo \"skipped\"\#/" ./sollya/configure

tried to build again, and Sollya built!

Emscripten and .bc’s

Emscripten seems to output an LLVM bitcode (.bc) file by default – and I couldn’t work out how to tell emconfigure to output a JavaScript file instead.

So what I did was take the bitcode file that was in ‘sollya’ and used emcc directly to turn this into a JavaScript file.

emcc complained if the input bitcode file wasn’t named <something>.bc, so I first renamed it to sollya.bc:

cp sollya sollya.bc
emcc sollya.bc -o sollya.js

and I got a whopping 27MB JavaScript file out!

Next I used node to run this JavaScript against a simple test script I wrote:

print("Single precision:");
r=[1/sqrt(2)-1; sqrt(2)-1];
f=log2(1+x)/x;
p=fpminimax(f, 11, [|single...|], r, floating, relative);
p;
print("\nDouble precision:");
p=fpminimax(f, 21, [|double...|], r, floating, relative);
p;

and ran Node.js:

nodejs sollya.js < script.sollya

and it ran!

But it kept on running, like infinite-loop running – the thing just never stopped. I was getting a ton of ‘sigaction not implemented’ messages, so I wondered if Sollya was doing something really ugly with signals to handle exiting from a script. I thought about digging into it, then realised Sollya has an explicit ‘quit;’ command, so I added that to the bottom of the script:

print("Single precision:");
r=[1/sqrt(2)-1; sqrt(2)-1];
f=log2(1+x)/x;
p=fpminimax(f, 11, [|single...|], r, floating, relative);
p;
asd;
print("\nDouble precision:");
p=fpminimax(f, 21, [|double...|], r, floating, relative);
p;
quit;

and it ran and exited as expected.

> Single precision:
> Warning: at least one of the given expressions is not a constant but requires evaluation.
Evaluation is guaranteed to ensure the inclusion property. The approximate result is at least 165 bit accurate.
> > > 1.44269502162933349609375 + x * (-0.7213475704193115234375 + x * (0.4809020459651947021484375 + x * (-0.360668718814849853515625 + x * (0.2883343398571014404296875 + x * (-0.24055089056491851806640625 + x * (0.21089743077754974365234375 + x * (-0.1813324391841888427734375 + x * (0.10872711241245269775390625 + x * (-0.10412885248661041259765625 + x * (0.35098421573638916015625 + x * (-0.383228302001953125)))))))))))
> Warning: the identifier "asd" is neither assigned to, nor bound to a library function nor external procedure, nor equal to the current free variable.
Will interpret "asd" as "x".
x
> 
Double precision:
> > 1.44269504088896338700465094007086008787155151367187 + x * (-0.72134752044448169350232547003543004393577575683594 + x * (0.4808983469630028761976348050666274502873420715332 + x * (-0.36067376022224723053355432966782245784997940063477 + x * (0.288539008174513611493239295668900012969970703125 + x * (-0.24044917347913088989663776828820118680596351623535 + x * (0.20609929188248227172053361755388323217630386352539 + x * (-0.18033688048265933412395156665297690778970718383789 + x * (0.160299431107568057797152505372650921344757080078125 + x * (-0.144269475404082331282396012284152675420045852661133 + x * (0.13115467750388201673139576541871065273880958557129 + x * (-0.120225818807988840686284959247132064774632453918457 + x * (0.110964912764316969706612781010335311293601989746094 + x * (-0.103018221150312991318820365904684877023100852966309 + x * (9.6317404417675320238423353202961152419447898864746e-2 + x * (-9.0652910508716211257507211485062725841999053955078e-2 + x * (8.4035326134831819788750806310417829081416130065918e-2 + x * (-7.5783141066360651394440139938524225726723670959473e-2 + x * (7.650699022117241065998882731946650892496109008789e-2 + x * (-9.2331285631306825312236696845502592623233795166016e-2 + x * (8.7941823766079466051515112212655367329716682434082e-2 + x * (-3.8635539215562890447142052607887308113276958465576e-2)))))))))))))))))))))

So now I have a JavaScript file that works when I run it through Node.js, but we’ve got a couple of issues:

  • The JavaScript is freaking huge!
  • We don’t want to require Node.js to be installed either for our developers.

File size

Digging into Emscripten I found that there were a couple of options we could use:

  • -O3 – same as all compilers, we can specify that the compiler should optimize the code heavily.
  • -llvm-lto 2 – this enables all the optimizations to occur on the entire set of bitcode files once they are all linked together. This will allow for a ton more inlining to take place which should help our performance.

Adding both these options, the size of the produce sollya.js was 4.1MB! A whopping 6.5x reduction in file size – and its actually optimized properly now too.

Creating a standalone Windows binary?

So I’ve got sollya.js – and I can run this with Node.js on Windows and get actual valid polynomials. But I really want a standalone executable that has no dependencies, is this possible? Searching around, I found out about nexe – a way to bundle a Node.js application into a single executable. It basically puts Node.js and the JavaScript file into the same executable, and calls Node.js on the JavaScript at runtime. While this isn’t amazing – would it work?

First off – you have to use nexe on the platform you want to run the end executable on  – so I copied the sollya.js from my VM to my Windows host, and then after installing nexe I ran:

nexe -i sollya.js -f -o sollya.exe

And what do you know – I can run sollya.exe and it works as expected. The downside is that because the executable is shipping an entire copy of Node.js with it – sollya.exe is a whopping 29MB to ship around.

Performance

I’ve compared the natively compiled sollya executable with the JavaScript variant. I ran them 50 times, and averaged out the results.

sollya sollyajs  JS vs Native Performance
1.37144s 4.93946s 3.6x slower

So as expected – given that we are running through JavaScript and Node.js, we are 3.6x slower than the natively compiled executable. I’m honestly surprised we are not slower (I’d heard horror stories of 90x slowdowns with Emscripten) so this seems not too bad to me.

Conclusion

It seems that with Emscripten, in combination with Node.js and Nexe, we can compile a program on Linux to be run entirely on Windows – which is pretty freaking cool. There are probably many other more sane ways to do exactly this, but I find it pretty amazing that this is even possible. Now I can ‘natively’ run a Windows executable which will calculate all the polynomial approximations I need on Windows too – saving our team from having to have a Linux VM when re-generating the polynomials is required.

CMake script to build Sollya with Emscripten

In case anyone is interested, I use a CMake file to bring in all the dependencies and build Sollya using Emscripten.

cmake_minimum_required(VERSION 3.4)
project(emsollya)

include(ExternalProject)

ExternalProject_Add(libxml2
  PREFIX ${CMAKE_BINARY_DIR}/libxml2
  URL ftp://xmlsoft.org/libxml2/libxml2-git-snapshot.tar.gz
  PATCH_COMMAND NOCONFIGURE=1 sh ${CMAKE_BINARY_DIR}/libxml2/src/libxml2/autogen.sh
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/libxml2/src/libxml2/configure
    --disable-shared --without-python --prefix=${CMAKE_BINARY_DIR}/libxml2
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(gmp
  PREFIX ${CMAKE_BINARY_DIR}/gmp
  URL https://gmplib.org/download/gmp/gmp-6.1.2.tar.bz2
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/gmp/src/gmp/configure
    --disable-assembly --enable-cxx --disable-shared
    --prefix=${CMAKE_BINARY_DIR}/gmp
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(mpfr
  DEPENDS gmp
  PREFIX ${CMAKE_BINARY_DIR}/mpfr
  URL http://www.mpfr.org/mpfr-current/mpfr-3.1.6.tar.bz2
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/mpfr/src/mpfr/configure
    --disable-shared --with-gmp=${CMAKE_BINARY_DIR}/gmp
    --prefix=${CMAKE_BINARY_DIR}/mpfr
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(mpfi
  DEPENDS gmp mpfr
  PREFIX ${CMAKE_BINARY_DIR}/mpfi
  URL https://gforge.inria.fr/frs/download.php/file/30129/mpfi-1.5.1.tar.bz2
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/mpfi/src/mpfi/configure
    --disable-shared --with-gmp=${CMAKE_BINARY_DIR}/gmp
    --with-mpfr=${CMAKE_BINARY_DIR}/mpfr
    --prefix=${CMAKE_BINARY_DIR}/mpfi
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(fplll
  DEPENDS gmp mpfr
  PREFIX ${CMAKE_BINARY_DIR}/fplll
  GIT_REPOSITORY https://github.com/fplll/fplll.git
  GIT_TAG cd47f76b017762317245de7878c7b41eff9ab5d0
  PATCH_COMMAND sh ${CMAKE_BINARY_DIR}/fplll/src/fplll/autogen.sh
  CONFIGURE_COMMAND emconfigure ${CMAKE_BINARY_DIR}/fplll/src/fplll/configure
    --disable-shared --with-gmp=${CMAKE_BINARY_DIR}/gmp
    --with-mpfr=${CMAKE_BINARY_DIR}/mpfr
    --prefix=${CMAKE_BINARY_DIR}/fplll
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Add(sollya
  DEPENDS gmp mpfr mpfi fplll libxml2
  PREFIX ${CMAKE_BINARY_DIR}/sollya
  URL http://sollya.gforge.inria.fr/sollya-weekly-11-05-2017.tar.bz2
  PATCH_COMMAND sed -i -e "s/as_fn_error .. \"libdl unusable\"/echo \"skipped\"\#/"
    ${CMAKE_BINARY_DIR}/sollya/src/sollya/configure
  CONFIGURE_COMMAND EMCONFIGURE_JS=1 emconfigure
    ${CMAKE_BINARY_DIR}/sollya/src/sollya/configure
    --disable-shared --with-gmp=${CMAKE_BINARY_DIR}/gmp
    --with-fplll=${CMAKE_BINARY_DIR}/fplll
    --with-mpfi=${CMAKE_BINARY_DIR}/mpfi
    --with-mpfr=${CMAKE_BINARY_DIR}/mpfr
    --with-xml2=${CMAKE_BINARY_DIR}/libxml2
    --prefix=${CMAKE_BINARY_DIR}/fplll
  BUILD_COMMAND make
  INSTALL_COMMAND make install
)

ExternalProject_Get_Property(sollya BINARY_DIR)

add_custom_command(OUTPUT ${CMAKE_BINARY_DIR}/sollya.js
  COMMAND cmake -E copy ${BINARY_DIR}/sollya ${CMAKE_BINARY_DIR}/sollya.bc
  COMMAND emcc --memory-init-file 0 -O3 --llvm-lto 2 ${CMAKE_BINARY_DIR}/sollya.bc -o ${CMAKE_BINARY_DIR}/sollya.js
  DEPENDS ${BINARY_DIR}/sollya
)

add_custom_target(sollya_js ALL DEPENDS ${CMAKE_BINARY_DIR}/sollya.js)

add_dependencies(sollya_js sollya)
06 Nov

LLVM & CRT – auto-magically selecting the correct CRT

LLVM comes with a really useful set of options LLVM_USE_CRT_<config> which allows you to specify a different C RunTime (CRT) when compiling with Visual Studio. If you want to be able to compile LLVM as a release build, but compile some code that uses it in debug (EG. our ComputeAorta product that allows customers to implement OpenCL/Vulkan on their hardware insanely quickly), Visual Studio will complain about mixing the differing version of the CRT. By using the LLVM_USE_CRT_<config> option, we can specify that LLVM compiles in a release build, but using a debug CRT.

There is one annoying catch with this though – compiling LLVM can be expensive to build. We’ll average 10 minutes build time for a full build of LLVM. We don’t want to recompile LLVM, and we don’t want to be constantly building different copies of LLVM everytime we pull in the latest commits for 2-4 different versions of the CRT. We want to be able to change a ComputeAorta build from debug/release without having to rebuild LLVM, and we want all this to just work™ without any manual input from a developer.

Changing LLVM

So what we need to do is detect which CRT LLVM was built against. My first thought was to allow LLVM to export which CRT it was built against into an LLVM install. LLVM already outputs an LLVMConfig.cmake during its install process, so why not just record what CRT was used too? I contacted the LLVM mailing list asking... and got no response. I’ve found in general if you are not a super active contributor and located in the bay area this is a common occurrence. Not wanting to be that guy that nags on the mailing list about things no-one else clearly cares about, how else could I solve it?

Detecting the CRT

So I reasoned that since the Visual Studio linker could detect and give me a good error message when I was accidentally mixing CRT versions, there must be some information recorded in the library files produced from Visual Studio that said which CRT the library was linked against. Using dumpbin.exe (which is included with Visual Studio) I first called:

$ dumpbin /? 
Microsoft (R) COFF/PE Dumper Version 14.00.24215.1 
Copyright (C) Microsoft Corporation. All rights reserved. 
 
usage: DUMPBIN [options] [files] 
 
 options: 
 
 /ALL 
 /ARCHIVEMEMBERS 
 /CLRHEADER 
 /DEPENDENTS 
 /DIRECTIVES 
 /DISASM[:{BYTES|NOBYTES}] 
 /ERRORREPORT:{NONE|PROMPT|QUEUE|SEND} 
 /EXPORTS 
 /FPO 
 /HEADERS 
 /IMPORTS[:filename] 
 /LINENUMBERS 
 /LINKERMEMBER[:{1|2}] 
 /LOADCONFIG 
 /NOLOGO 
 /OUT:filename 
 /PDATA 
 /PDBPATH[:VERBOSE] 
 /RANGE:vaMin[,vaMax] 
 /RAWDATA[:{NONE|1|2|4|8}[,#]] 
 /RELOCATIONS 
 /SECTION:name 
 /SUMMARY 
 /SYMBOLS 
 /TLS 
 /UNWINDINFO

And through a process of elimination I ran the ‘/DIRECTIVES’ command against one of the .lib files in LLVM which gave:

$ dumpbin /DIRECTIVES LLVMCore.lib
Microsoft (R) COFF/PE Dumper Version 14.00.24215.1
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file LLVMCore.lib

File Type: LIBRARY

   Linker Directives
   -----------------
   /FAILIFMISMATCH:_MSC_VER=1900
   /FAILIFMISMATCH:_ITERATOR_DEBUG_LEVEL=2
   /FAILIFMISMATCH:RuntimeLibrary=MDd_DynamicDebug
   /DEFAULTLIB:msvcprtd
   /FAILIFMISMATCH:_CRT_STDIO_ISO_WIDE_SPECIFIERS=0
   /FAILIFMISMATCH:LLVM_ENABLE_ABI_BREAKING_CHECKS=1
   /DEFAULTLIB:MSVCRTD
   /DEFAULTLIB:OLDNAMES

...

And what do you know ‘/FAILIFMISMATCH:RuntimeLibrary=MDd_DynamicDebug’ is telling the linker to output an error message if the CRT is not the dynamic debug variant! So now I have a method of detecting the CRT from one of LLVM’s libraries, how to incorporate that in our build?

CMake Integration

LLVM uses CMake for its builds, and thus we also use CMake for our builds. We already include LLVM by specifying the location of an LLVM install like:

$ cmake -DCA_LLVM_INSTALL_DIR=<directory> .
-- Overriding option 'CA_LLVM_INSTALL_DIR' to '<directory>' (default was '').

And then within our CMake we do:

# Setup LLVM/Clang search paths.
list(APPEND CMAKE_MODULE_PATH
  ${CA_LLVM_INSTALL_DIR}/lib/cmake/llvm
  ${CA_LLVM_INSTALL_DIR}/lib/cmake/clang)

# Include LLVM.
include(LLVMConfig)

# Include Clang.
include(ClangTargets)

So I added a new DetectLLVMMSVCCRT.cmake to our CMake modules and included it just after the ClangTargets include. This does the following:

  • Get the directory of CMAKE_C_COMPILER (always cl.exe in our case).
  • Look for dumpbin.exe in the same directory.
  • Get the location of LLVMCore.lib.
    • My reasoning is that most libraries in LLVM could change over time, but the core library of LLVM is unlikely to be moved (I hope!).
  • Run dumpbin /DIRECTIVES LLVMCore.lib
    • Find the first usage of ‘/FAILIFMISMATCH:RuntimeLibrary=’
    • Get the string that occurs between ‘/FAILIFMISMATCH:RuntimeLibrary=’ and the next ‘_’

And then we’ve got the CRT we need to use to build with. To actually set the CRT to use, we can just call LLVM’s ChooseMSVCCRT.cmake (that ships in an LLVM install), specifying the LLVM_USE_CRT_<config> variables and voila, we’ll be using the same CRT as LLVM, and get no linker errors!

The full CMake script is:

if(NOT CMAKE_SYSTEM_NAME STREQUAL Windows)
  return()
endif()

# Get the directory of cl.exe
get_filename_component(tools_dir "${CMAKE_C_COMPILER}" DIRECTORY)

# Find the dumpbin.exe executable in the directory of cl.exe
find_program(dumpbin "dumpbin.exe" PATHS "${tools_dir}" NO_DEFAULT_PATH)

if("${dumpbin}" STREQUAL "dumpbin-NOTFOUND")
  message(WARNING "Could not detect which CRT LLVM was built against - "
                  "could not find 'dumpbin.exe'.")
  return()
endif()

# Get the location in the file-system of LLVMCore.lib
get_target_property(llvmcore LLVMCore LOCATION)

if("${llvmcore}" STREQUAL "llvmcore-NOTFOUND")
  message(WARNING "Could not detect which CRT LLVM was built against - "
                  "could not find location of 'LLVMCore.lib'.")
  return()
endif()

# Get the directives that LLVMCore.lib contains
execute_process(COMMAND "${dumpbin}" "/DIRECTIVES" "${llvmcore}"
  OUTPUT_VARIABLE output)

# Find the first directive specifying what CRT to use
string(FIND "${output}" "/FAILIFMISMATCH:RuntimeLibrary=" position)

# Strip away everything but the directive we want to examine
string(SUBSTRING "${output}" ${position} 128 output)

# Remove the directive prefix which we don't need
string(REPLACE "/FAILIFMISMATCH:RuntimeLibrary=" "" output "${output}")

# Get the position of the '_' character that breaks the CRT from all else
string(FIND "${output}" "_" position)

# Substring output to be one of the four CRT values: MDd MD MTd MT
string(SUBSTRING "${output}" 0 ${position} output)

# Set all possible CMAKE_BUILD_TYPE's to the CRT that LLVM was linked against
set(LLVM_USE_CRT_DEBUG "${output}")
set(LLVM_USE_CRT_RELWITHDEBINFO "${output}")
set(LLVM_USE_CRT_MINSIZEREL "${output}")
set(LLVM_USE_CRT_RELEASE "${output}")

# Include the LLVM cmake module to choose the correct CRT
include(ChooseMSVCCRT)

Conclusion

We’ve been able to do what we set out to do – auto-magically make our project that uses an LLVM install work reliably even with mixed Debug/Release builds. This has reduced the number of LLVM compiles I do daily by 2x (yay) and also allowed me to stop tracking (and caring) about CRT conflicts and how to avoid them.

28 Oct

Slides from my Khronos Munich Chapter talk

I gave a talk on Friday 13th of October 2017 at the Khronos Munich Chapter titled ‘OpenCL to Vulkan: A Porting Guide’. I covered how to port from the OpenCL API to the Vulkan API, some common problems our customers have faced, and how to fix them. The slides are available here.

The talk covered some of the major pitfalls our customers have had in porting OpenCL applications to Vulkan, and also briefly covered the work we did in collaboration with Google and Adobe – clspv.

I hope the slide deck is useful to those of you who couldn’t attend in person.

07 Sep

I’m speaking at the Munich Khronos Chapter Meeting 13th October 2017

Previously I had begun a series of blog posts detailing how to port applications from OpenCL -> Vulkan.

  1. OpenCL -> Vulkan: A Porting Guide (#1)
  2. OpenCL -> Vulkan: A Porting Guide (#2)
  3. OpenCL -> Vulkan: A Porting Guide (#3)

Instead of continuing this blog series, I’m converting the entire contents into a slide deck, and will be presenting it at the Munich Khronos Chapter meeting on the 13th of October 2017.

So please come along and watch myself, and the other great speakers, talk about some fun things you can do with Vulkan!

Look forward to seeing y’all there.

29 Jun

OpenCL -> Vulkan: A Porting Guide (#3)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is a series of blog posts on how to port from OpenCL to Vulkan:

  1. OpenCL -> Vulkan: A Porting Guide (#1)
  2. OpenCL -> Vulkan: A Porting Guide (#2)

In this post, we’ll cover the different queue synchronization mechanisms in OpenCL and Vulkan.

clFinish vs vkWaitForFences

In the previous post I explained that an OpenCL queue (cl_command_queue) was an amalgamation of two distinct concepts:

  1. A collection of workloads to run on some hardware
  2. A thing that will run various workloads and allow interactions between them

Whereas Vulkan uses a VkCommandBuffer for 1, and a VkQueue for 2.

One common synchronization users want to do is let a queue execute a bunch of work, and wait for all that work to be done.

In OpenCL, you can wait on all previously submitted commands to a queue by using clFinish.

cl_command_queue queue; // previously created

// submit work to the queue
if (CL_SUCCESS != clFinish(queue)) {
  // ... error!
}

In Vulkan, because a queue is just a thing to run workloads on, we instead have to wait on the command buffer itself to complete. This is done via a VkFence which is specified when submitting work to a VkQueue.

VkCommandBuffer commandBuffer; // previously created
VkFence fence; // previously created

// submit work to the commandBuffer

VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    fence)) {
  // ... error!
}

if (VK_SUCCESS != vkWaitForFences(
    device,
    1,
    &fence,
    VK_TRUE,
    UINT64_MAX)) {
  // ... error!
}

One thing to note is that you can wait on a Vulkan queue to finish all submitted workloads, but remember the difference between Vulkan queues and OpenCL queues. Vulkan queue’s are retrieved from a device. If multiple parts of your code (including third party libraries) retrieve the same Vulkan queue and are executing workloads on it, you will end up waiting for someone else’s work to complete.

TL;DR – waiting on a queue in Vulkan is not the same as OpenCL.

Dependencies within a cl_command_queue / VkCommandBuffer

Both OpenCL and Vulkan have mechanisms to ensure a command will only begin executing once another command has completed.

Firstly, remember that an OpenCL command queue by default will be in order. What this means is that by default when you submit commands into an OpenCL command queue each command will only begin executing once the preceding command has completed. While this isn’t ideal in a number of situations for performance, it is advantageous for users to get up and running in a safe and quick manner.

OpenCL also allows command queue’s to be out of order. This means that commands submitted to a queue are guaranteed to be dispatched in order but that they may run concurrently and/or complete out of order.

Using an out of order OpenCL queue, to get commands to wait on other commands before beginning executing, you use a cl_event to create a dependency between both the commands.

cl_buffer bufferA, bufferB, bufferC; // previously created

cl_event event;

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    bufferA,
    bufferB,
    0,
    0,
    42,
    0,
    nullptr,
    &event)) {
  // ... error!
}

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    bufferB,
    bufferC,
    0,
    0,
    42,
    1,
    &event,
    nullptr)) {
  // ... error!
}

We can guarantee that if queue above was an out of order queue, the commands would still be executed in order because we expressed the dependency between both commands.

In Vulkan queues are out of order. There is also no exact matching mechanism to get two arbitrary commands to depend on one another. Vulkan relies on more knowledge of what you are actually trying to do to create the right kind of synchronization between commands.

The easiest (and in no way more performant) way to map OpenCL code with an event dependency between two commands, or if the OpenCL queue was created in order, is to have separate Vulkan command buffers for each command. While this might seem crude, it’ll allow you to use another of Vulkan’s synchronization mechanisms to solve the problem – the semaphore.

VkBuffer bufferA, bufferB, bufferC; // previously created
VkCommandBuffer commandBuffer1; // previously created
VkCommandBuffer commandBuffer2; // previously created

VkSemaphoreCreateInfo semaphoreCreateInfo = {
  VK_STRUCTURE_TYPE_SEMAPHORE_CREATE_INFO,
  nullptr,
  0
};

VkSemaphore semaphore;

if (VK_SUCCESS != vkCreateSemaphore(
    device,
    &semaphoreCreateInfo,
    nullptr,
    &semaphore)) {
  // ... error!
}

VkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer1,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer1, bufferA, bufferB, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer1)) {
  // ... error!
}
VkSubmitInfo submitInfo1 = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer1,
  1,
  &semaphore,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo1,
    nullptr)) {
  // ... error!
}

VkPipelineStageFlags pipelineStageFlags =
    VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer2,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer2, bufferB, bufferC, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer2)) {
  // ... error!
}

VkSubmitInfo submitInfo2 = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  1,
  &semaphore,
  &pipelineStageFlags,
  1,
  &commandBuffer2,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo2,
    nullptr)) {
  // ... error!
}

A Vulkan semaphore allows you to express dependencies between command buffers. So by placing each command into a command buffer we can use a semaphore between these command buffers to emulate the OpenCL behaviour of in order queues and arbitrary command dependencies.

As with everything in Vulkan – the way to get performance is to explain to the driver exactly what you intend to do. In our example where we are copying data from buffer A -> buffer B -> buffer C above, we are basically creating a dependency on our usage of buffer B. The copy from buffer B -> buffer C cannot begin until the copy from buffer A -> buffer B has complete. So Vulkan gives us the tools to tell the driver about this dependency explicitly, and we can use them within a single command buffer.

The most analogous approach to the OpenCL example is to use a Vulkan event to encode the dependency.

VkEventCreateInfo eventCreateInfo = {
  VK_STRUCTURE_TYPE_EVENT_CREATE_INFO,
  nullptr,
  0
};

VkEvent event;

if (VK_SUCCESS != vkCreateEvent(
    device,
    &eventCreateInfo,
    nullptr,
    &event)) {
  // ... error!
}

Note that we create the event explicitly with Vulkan, unlike in OpenCL where any clEnqueue* command has an optional out_event parameter as the last parameter.

VkBuffer bufferA, bufferB, bufferC; // previously created
VkCommandBuffer commandBuffer; // previously created

VkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer, bufferA, bufferB, 1, &bufferCopy);

vkCmdSetEvent(
    commandBuffer, 
    event,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT);

VkMemoryBarrier memoryBarrier = {
  VK_STRUCTURE_TYPE_MEMORY_BARRIER,
  nullptr,
  VK_ACCESS_MEMORY_WRITE_BIT,
  VK_ACCESS_MEMORY_READ_BIT
};

vkCmdWaitEvents(
    commandBuffer,
    1,
    &event,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    1,
    &memoryBarrier,
    0,
    nullptr,
    0,
    nullptr)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer, bufferB, bufferC, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer)) {
  // ... error!
}
VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    nullptr)) {
  // ... error!
}

So to do a similar thing to OpenCL’s event chaining semantics we:

  1. add our buffer A -> buffer B copy command
  2. set an event that will trigger when all previous commands are complete, in our case the current set of all previous commands is the one existing copy buffer command
  3. wait for the previous event to complete, specifying that all memory operations that performed a write before this wait must be resolved, and that all read operations after this event can read them
  4. add our buffer B -> buffer C copy command

Now we can be even more explicit with Vulkan and specifically use VK_ACCESS_TRANSFER_READ_BIT and VK_ACCESS_TRANSFER_WRITE_BIT – but I’m using the much more inclusive VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT to be clear what OpenCL will be doing implicitly for you as a user.

Dependencies between multiple cl_command_queue’s / VkCommandBuffer’s

When synchronizing between multiple cl_command_queue’s in OpenCL we use the exact same mechanism as with one queue.

cl_buffer bufferA, bufferB, bufferC; // previously created
cl_command_queue queue1, queue2; // previously created

cl_event event;

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue1,
    bufferA,
    bufferB,
    0,
    0,
    42,
    0,
    nullptr,
    &event)) {
  // ... error!
}

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue2,
    bufferB,
    bufferC,
    0,
    0,
    42,
    1,
    &event,
    nullptr)) {
  // ... error!
}

The command queue queue2 will not begin executing the copy buffer command until the first command queue queue1 has completed its execution. Having the same mechanism for creating dependencies within a queue and outwith a queue is a very nice thing from a user perspective – there is one true way to create a synchronization between commands in OpenCL.

In Vulkan, when we are wanting to create a dependency between two VkCommandBuffer’s the easiest way is to use the semaphore approach I showed above. You could also use a VkEvent that is triggered at the end of one command buffer and waited on at the beginning of another. If you want to amortize the cost of doing multiple submits to the same queue, then use the event approach.

You can also use both of these mechanisms to create dependencies between multiple Vulkan queues. Remember that a Vulkan queue can be thought of as an exposition of some physical concurrency in the hardware, or in other words, running things on two distinct queues concurrently can lead to a performance improvement.

I recommend using a semaphore as the mechanism to encode dependencies between queues for the most part as it is simpler to get right.

The main place where using the event approach is when you have a long command buffer, where after only a few commands you can unblock the concurrently runnable queue to begin execution. In this case you’d be better using an event as that will enable the other queue to begin executing much earlier than would previously be possible.

clEnqueueBarrierWithWaitList vs vkCmdPipelineBarrier

Both OpenCL and Vulkan have a barrier that acts as a memory and execution barrier. When you have a pattern whereby you have N commands that must have completed execution before another M commands begin, a barrier is normally the answer.

// N commands before here...

if (CL_SUCCESS != clEnqueueBarrierWithWaitList(
    queue,
    0,
    nullptr,
    nullptr)) {
  // ... error!
}

// M commands after here will only begin once
// the previous N commands have completed!

And the corresponding Vulkan:

VkMemoryBarrier memoryBarrier = {
  VK_STRUCTURE_TYPE_MEMORY_BARRIER,
  nullptr,
  VK_ACCESS_MEMORY_WRITE_BIT,
  VK_ACCESS_MEMORY_READ_BIT
};

// N commands before here...

vkCmdPipelineBarrier(
    commandBuffer,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    1,
    &memoryBarrier,
    0,
    nullptr,
    0,
    nullptr)) {
  // ... error!
}

// M commands after here will only begin once
// the previous N commands have completed!

What’s next?

After this monstrous dive into porting OpenCL’s synchronization mechanisms to Vulkan, in the next post we’ll look at the differences between OpenCL’s kernels and Vulkan’s pipelines – stay tuned!

16 Jun

OpenCL -> Vulkan: A Porting Guide (#2)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is a series of blog posts on how to port from OpenCL to Vulkan:

  1. OpenCL -> Vulkan: A Porting Guide (#1)

In this post, we’ll cover porting from OpenCL’s cl_command_queue to Vulkan’s VkQueue.

cl_command_queue -> VkCommandBuffer and VkQueue

OpenCL made a poor choice when cl_command_queue was designed. A cl_command_queue is an amalgamation of two very distinct things:

  1. A collection of workloads to run on some hardware
  2. A thing that will run various workloads and allow interactions between them

Vulkan broke this into the two constituent parts, for 1. we have a VkCommandBuffer, an encapsulation of one or more commands to run on a device. For 2. we have a VkQueue, the thing that will actually run these commands and allow us to synchronize on the result.

Without diving too deeply, Vulkan’s approach allows for a selection of commands to be built once, and then run multiple times. For a huge number of compute workloads we run on datasets, we’re running the same set of commands thousands of times – and Vulkan allows us to amortise the cost of building up this collection of commands to run.

Back to OpenCL, we use clCreateCommandQueue (for pre 2.0) / clCreateCommandQueueWithProperties to create this amalgamated ‘collection of things I want you to run and a way of running them’. We’ll enable CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE as that is the behaviour of a Vulkan VkQueue (although remember that not all OpenCL devices actually support out of order queues – I’m doing this to allow the mental mapping of how Vulkan executes command buffers on queues to bake into your mind).

cl_queue_properties queueProperties[3] = {
    CL_QUEUE_PROPERTIES,
    CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
    0
};

cl_command_queue queue = clCreateCommandQueueWithProperties(
    context,
    device,
    queueProperties,
    &errorcode);

if (CL_SUCCESS != errorcode) {
 // ... error!
}

The corresponding object in Vulkan is the VkQueue – which we get from the device, rather than creating as OpenCL does. This is because a queue in Vulkan is more like a physical aspect of the device, rather than some software construct – this isn’t mandated in the specification, but its a useful mental model to adopt when thinking about Vulkan’s queues.

Remember that when we created our VkDevice we requested which queue families we wanted to use with the device? Now to actually get a queue that supports compute, we have to choose one of the queue family indices that supported compute, and get the corresponding VkQueue from that queue family.

VkQueue queue;

uint32_t queueFamilyIndex = UINT32_MAX;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT & queueFamilyProperties[i].queueFlags) {
    queueFamilyIndex = i;
    break;
  }
}

if (UINT_MAX == queueFamilyIndex) {
  // ... error!
}

vkGetDeviceQueue(device, queueFamilyIndex, 0, &queue);

clEnqueue* vs vkCmd*

To actually execute something on a device, OpenCL uses commands that begin with clEnqueue* – this command will enqueue work onto a command queue and possibly begin execution it. Why possibly? OpenCL is utterly vague on when commands actually begin executing. The specification states that a call to clFlush, clFinish, or clWaitForEvents on an event that is being signalled by a previously enqueued command on a command queue will guarantee that the device has actually begun executing. It is entirely valid that an implementation begin executing work when the clEnqueue* command is called, and equally valid that the implementation delays until a bunch of clEnqueue* commands are in the queue and the corresponding clFlush/clFinish/clWaitForEvents is called.

cl_mem src, dst; // Two previously created buffers

cl_event event;
if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    src,
    dst,
    0, // src offset
    0, // dst offset
    42, // size in bytes to copy
    0,
    nullptr,
    &event)) {
  // ... error!
}

// If we were going to enqueue more stuff on the command queue,
// but wanted the above command to definitely begin execution,
// we'd call flush here.
if (CL_SUCCESS != clFlush(queue)) {
  // ... error!
}

// We could either call finish...
if (CL_SUCCESS != clFinish(queue)) {
  // ... error!
}

// ... or wait for the event we used!
if (CL_SUCCESS != clWaitForEvents(1, &event)) {
  // ... error!
}

In contrast, Vulkan requires us to submit all our commands into a VkCommandBuffer. First we need to create the command buffer.

VkCommandPoolCreateInfo commandPoolCreateInfo = {
  VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO,
  0,
  0,
  queueFamilyIndex
};

VkCommandPool commandPool;

if (VK_SUCCESS != vkCreateCommandPool(
    device,
    &commandPoolCreateInfo,
    0,
    &commandPool)) {
  // ... error!
}

VkCommandBufferAllocateInfo commandBufferAllocateInfo = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO,
  0,
  commandPool,
  VK_COMMAND_BUFFER_LEVEL_PRIMARY,
  1 // We are creating one command buffer.
};

VkCommandBuffer commandBuffer;

if (VK_SUCCESS != vkAllocateCommandBuffers(
    device,
    &commandBufferAllocateInfo,
    &commandBuffer)) {
  // ... error!
}

Now we have our command buffer with which we can queue up commands to execute on a Vulkan queue.

VkBuffer src, dst; // Two previously created buffers

VkCommandBufferBeginInfo commandBufferBeginInfo = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO,
  0,
  VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT,
  0
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

vkCmdCopyBuffer(commandBuffer, src, dst, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer)) {
  // ... error!
}

VkFenceCreateInfo fenceCreateInfo = {
  VK_STRUCTURE_TYPE_FENCE_CREATE_INFO,
  0,
  0
};

VkFence fence;

if (VK_SUCESS != VkFenceCreateInfo(
    device,
    &fenceCreateInfo,
    0,
    &fence)) {
  // ... error!
}

VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    fence)) {
  // ... error!
}

// We can either wait on our commands to complete by fencing...
if (VK_SUCCESS != vkWaitForFences(
    device,
    1,
    &fence,
    VK_TRUE,
    UINT64_MAX)) {
  // ... error!
}

// ... or waiting for the entire queue to have finished...
if (VK_SUCCESS != vkQueueWaitIdle(queue)) {
  // ... error!
}

// ... or even for the entire device to be idle!
if (VK_SUCCESS != vkDeviceWaitIdle(device)) {
  // ... error!
}

Vulkan gives us many more ways to synchronize on host for when we are complete with our workload. We can specify a VkFence to our queue submission to wait on one of more command buffers in that submit, we can wait for the queue to be idle, or even wait for the entire device to be idle! Fences and command buffers can be reused by calling VkResetFences and VkResetCommandBuffer respectively – note that the command buffer can be reused for an entirely different set of commands to be executed. If you wanted to resubmit the exact same command buffer, you’d have to remove VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT flag in the VkCommandBufferBeginInfo struct above.

So a crucial thing to note here – synchronizing on a cl_command_queue is similar to a VkQueue, but the mechanisms are not identical.

We’ll cover these queue synchronization mechanisms in more detail in the next post in the series.

06 Jun

OpenCL -> Vulkan: A Porting Guide (#1)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is part 1 of my guide for how things map between the APIs!

cl_platform_id -> VkInstance

In OpenCL, the first thing you do is get the platform identifiers (using clGetPlatformIDs).

// We do not strictly need to initialize this to 0 (as it'll
// be set by clGetPlatformIDs), but given a lot people do
// not check the error code returns, it's safer to 0
// initialize.
cl_uint numPlatforms = 0;
if (CL_SUCCESS != clGetPlatformIDs(
    0,
    nullptr,
    &numPlatforms)) {
  // ... error!
}

std::vector<cl_platform_id> platforms(numPlatforms);

if (CL_SUCCESS != clGetPlatformIDs(
    platforms.size(),
    platforms.data(),
    nullptr)) {
  // ... error!
}

Each cl_platform_id is a handle into an individual vendors OpenCL driver – if you had an AMD and NVIDIA implementation of OpenCL on your system, you’d get two cl_platform_id’s returned.

Vulkan is different here – instead of getting one or more handles to individual vendors implementations, we instead create a single VkInstance (via vkCreateInstance).

const VkApplicationInfo applicationInfo = {
  VK_STRUCTURE_TYPE_APPLICATION_INFO,
  0,
  "MyAwesomeApplication",
  0,
  "",
  0,
  VK_MAKE_VERSION(1, 0, 0)
};
 
const VkInstanceCreateInfo instanceCreateInfo = {
  VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO,
  0,
  0,
  &applicationInfo,
  0,
  0,
  0,
  0
};
 
VkInstance instance;
if (VK_SUCCESS != vkCreateInstance(
    &instanceCreateInfo,
    0,
    &instance)) {
  // ... error!
}

This single instance allows us to access multiple vendor implementations of the Vulkan API through a single object.

cl_device_id -> VkPhysicalDevice

In OpenCL, you can query one or more cl_device_id’s from each cl_platform_id that we previously queried (via clGetDeviceIDs). When querying for a device, we can specify a cl_device_type, where you can basically ask the driver to give you its default device (normally a GPU) or for a specific device type. We’ll use CL_DEVICE_TYPE_ALL, in that we are instructing the driver to return all the devices it knows about, and we can choose from them.

cl_uint numDevices = 0;

for (cl_uint i = 0; i < platforms.size(); i++) {
  // We do not strictly need to initialize this to 0 (as it'll
  // be set by clGetDeviceIDs), but given a lot people do
  // not check the error code returns, it's safer to 0
  // initialize.
  cl_uint numDevicesForPlatform = 0;

  if (CL_SUCCESS != clGetDeviceIDs(
      platforms[i],
      CL_DEVICE_TYPE_ALL,
      0,
      nullptr,
      &numDevicesForPlatform)) {
    // ... error!
  }

  numDevices += numDevicesForPlatform;
}

std::vector<cl_device_id> devices(numDevices);

// reset numDevices as we'll use it for our insertion offset
numDevices = 0;

for (cl_uint i = 0; i < platforms.size(); i++) {
  cl_uint numDevicesForPlatform = 0;

  if (CL_SUCCESS != clGetDeviceIDs(
      platforms[i],
      CL_DEVICE_TYPE_ALL,
      0,
      nullptr,
      &numDevicesForPlatform)) {
    // ... error!
  }

  if (CL_SUCCESS != clGetDeviceIDs(
      platforms[i],
      CL_DEVICE_TYPE_ALL,
      numDevicesForPlatform,
      devices.data() + numDevices,
      nullptr)) {
    // ... error!
  }

  numDevices += numDevicesForPlatform;
}

The code above is a bit of a mouthful – but it is the easiest way to get every device that the system knows about.

In contrast, since Vulkan gave us a single VkInstance, we query that single instance for all of the VkPhysicalDevice’s it knows about (via vkEnumeratePhysicalDevices). A Vulkan physical device is a link to the actual hardware that the Vulkan code is going to execute on.

uint32_t physicalDeviceCount = 0;

if (VK_SUCCESS != vkEnumeratePhysicalDevices(
    instance,
    &physicalDeviceCount,
    0)) {
  // ... error!
}

std::vector<VkPhysicalDevice> physicalDevices(physicalDeviceCount);

if (VK_SUCCESS != vkEnumeratePhysicalDevices(
    instance,
    &physicalDeviceCount,
    physicalDevices.data())) {
  // ... error!
}

A prominent API design fork can be seen between vkEnumeratePhysicalDevices and clGetDeviceIDs – Vulkan reuses the integer return parameter to the function (the parameter that lets you query the number of physical devices present) to also pass into the driver the number of physical devices we want filled out. In contrast, OpenCL uses an extra parameter for this. These patterns are repeated throughout both APIs.

cl_context -> VkDevice

Here is where it gets trickier between the APIs. OpenCL has a notion of a context – you can think of this object as your way as the user to view and interact with what the system is doing. OpenCL allows multiple device’s that belong to a single platform to be shared within a context. In contrast, Vulkan is fixed to having a single physical device per it’s ‘context’, which Vulkan calls a VkDevice.

To make the porting easier, and because in all honesty I’ve yet to see any real use-case or benefit from having multiple OpenCL devices in a single context, we’ll make our OpenCL code create it’s cl_context using a single cl_device_id (via clCreateContext).

// One of the devices in our std::vector
cl_device_id device = ...;

cl_int errorcode;

cl_context context = clCreateContext(
    nullptr,
    1,
    &device,
    nullptr,
    nullptr,
    &errorcode);

if (CL_SUCCESS != errorcode) {
  // ... error!
}

The above highlights the single biggest travesty in the OpenCL API – the error code has changed from being something returned from the API call, to an optional pointer parameter at the end of the signature. In API design, I’d say this is rule #1 in how not to mess up an API (If you’re interested, these are two great API talks Designing and Evaluating Reusable Components by Casey Muratori and Hourglass Interfaces for C++ APIs by Stefanus Du Toit).

For Vulkan, when creating our VkDevice object, we specifically enable the features we want to use from the device upfront. The easy way to do this is to first call vkGetPhysicalDeviceFeatures, and then pass the result of this into our create device call, enabling all features that the device supports.

When creating our VkDevice, we need to explicitly request which queues we want to use. OpenCL has no real analogous concept to this – the naive comparison is to compare VkQueue’s against cl_command_queue’s, but I’ll show in a later post that this is a wrong conflation. Suffice to say, for our purposes we’ll query for all queues that support compute functionality, as that is almost what OpenCL is doing behind the scenes in the cl_context.

// One of the physical devices in our std::vector
VkPhysicalDevice physicalDevice = ...;

VkPhysicalDeviceFeatures physicalDeviceFeatures;

vkGetPhysicalDeviceFeatures(
    physicalDevice,
    physicalDeviceFeatures);

uint32_t queueFamilyPropertiesCount = 0;

vkGetPhysicalDeviceQueueFamilyProperties(
    physicalDevice,
    &queueFamilyPropertiesCount,
    0);

// Create a temporary std::vector to allow us to query for
// all the queue's our physical device supports.
std::vector<VkQueueFamilyProperties> queueFamilyProperties(
    queueFamilyPropertiesCount);

vkGetPhysicalDeviceQueueFamilyProperties(
    physicalDevice,
    &queueFamilyPropertiesCount,
    queueFamilyProperties.data());

uint32_t numQueueFamiliesThatSupportCompute = 0;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT &
      queueFamilyProperties[i].queueFlags) {
    numQueueFamiliesThatSupportCompute++;
  }
}

// Create a temporary std::vector to allow us to specify all
// queues on device creation
std::vector<VkDeviceQueueCreateInfo> queueCreateInfos(
    numQueueFamiliesThatSupportCompute);

// Reset so we can re-use as an index
numQueueFamiliesThatSupportCompute = 0;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT &
      queueFamilyProperties[i].queueFlags) {
    const float queuePrioritory = 1.0f;

    const VkDeviceQueueCreateInfo deviceQueueCreateInfo = {
        VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO,
        0,
        0,
        i,
        1,
        &queuePrioritory
    };

    queueCreateInfos[numQueueFamiliesThatSupportCompute] =
        deviceQueueCreateInfo;

    numQueueFamiliesThatSupportCompute++;
  }
}

const VkDeviceCreateInfo deviceCreateInfo = {
    VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO,
    0,
    0,
    queueCreateInfos.size(),
    queueCreateInfos.data(),
    0,
    0,
    0,
    0,
    0
 };

VkDevice device;
if (VK_SUCCESS != vkCreateDevice(
    physicalDevice,
    &deviceCreateInfo,
    0,
    &device)) {
  // ... error!
}

Vulkan’s almost legendary verbosity strikes here – we’re having to write a lot more code than the equivalent in OpenCL to get an almost analogous handle. The plus here is that for the Vulkan driver, it can do a lot more upfront allocations because a much higher proportion of its state is known at creation time – that is the fundamental approach of Vulkan, we are trading upfront verbosity for a more efficient application overall.

Ok – so we’ve now got the API to the point where we can think about actually using the plethora of hardware available from these APIs! Stay tuned for the next in the series where I’ll cover porting from OpenCL’s cl_command_queue to Vulkan’s VkQueue.

11 Mar

Adding JSON 5 to json.h

I’ve added JSON 5 support to my json.h library.

For those not in the know, JSON 5 (http://json5.org/) is a modern update to the JSON standard, including some cool features like unquoted keys, single quoted keys and strings, hexdecimal numbers, Infinity and NaN numbers, and c style comments!

As is sticking with the design of my lib – each of the features can be turned on individually if you don’t want the full shebang, or just add json_parse_flags_allow_json5 to enable the entire feature set.

The GitHub pull request brings in the functionality, and it is merged into master too!

16 Oct

Adding loops (MPC -> LLVM for the Neil Language #5)

This is part of a series, the first four parts of the series can be found at:

  1. Hooking up MPC & LLVM
  2. Cleaning up the parser
  3. Adding type identifiers
  4. Adding branching

In this post, we’ll cover how to add loops to our little toy language I’m calling Neil – Not Exactly an Intermediate Language.

To keep things simple, I’ve decided to add loops of the form:

while (<expression> <comparison operator> <expression>) {
  <statement>*
}

Grammar Changes

We need to add a new kind of statement to the grammar, one for our while loops:

stmt : \"return\" <lexp>? ';' 
     | <ident> '(' <ident>? (',' <ident>)* ')' ';' 
     | <typeident> ('=' <lexp>)? ';' 
     | <ident> '=' <lexp> ';' 
     | \"if\" '(' <bexp> ')' '{' <stmt>* '}'
     | \"while\" '(' <bexp> ')' '{' <stmt>* '}' ;

And with this one change, because we have already handled boolean expression in the additions for branching, we can handle our loops.

How to Handle Loops

Loops are basically branching – the only caveat is that we are going to branch backwards to previous, already executed, basic blocks.

loops

For every while statement we create two new basic blocks. Whatever basic block we are in (in the above example one called ‘entry’) will then conditionally enter the loop by branching either to the ‘while_body’ block (that will contain any statements within the while loop), or by branching to the ‘while_merge’ basic block. Within the body of the loop, the ‘while_body’ basic block will then conditionally (based on the bexp part of the grammar change) loop back to itself, or to the ‘while_merge’. This means that all loops converge as the loop finishes – they will always execute ‘while_merge’ whether the loop is entered or not.

Handling Whiles

To handle while statements:

  • we get an LLVMValueRef for the boolean expression – using LLVMBuildICmp or LLVMBuildFCmp to do so
  • once we have our expression, we increment the scope as all symbols need to be in the new scope level
  • we create two new basic blocks, one for ‘while_body’ and one for ‘while_merge’
  • we use LLVMBuildCondBr to branch, based on the LLVMValueRef for the condition, to either ‘while_body’ or ‘while_merge’
  • we then set the LLVMBuilderRef that we are using to build in the ‘while_body’ basic block
  • then we lower the statements in the while statement (which will all be placed within the ‘while_body’ basic block)
  • and after all statements in the while statement have been processed, we re-evaluate the boolean expression for the while loop, then use LLVMBuildCondBr to conditionally branch to ‘while_merge’, or back to ‘while_body’ if the while loop had more iterations required
  • and lastly set the LLVMBuilderRef to add any new statements into the ‘while_merge’ basic block

And it really is that simple! All the changes we made previously to handle if statements meant that this was a really easy change to add to the language.

Result

Now our simple example looks like so:

i32 foo(i32 x) {
  i32 y = x * 5;
  while (y > 13) {
    if (y < 4) { i32 z = x; y = z; }
    y = y + 42;
  }
  return y;
}
i32 main() {
  return foo(13);
}

And turns into the following LLVM IR:

define i32 @foo(i32 %x) {
entry:
  %y = alloca i32
  %0 = mul i32 %x, 5
  store i32 %0, i32* %y
  %1 = load i32, i32* %y
  %2 = icmp sgt i32 %1, 13
  br i1 %2, label %while_body, label %while_merge

while_body:                     ; preds = %if_merge, %entry
  %3 = load i32, i32* %y
  %4 = icmp slt i32 %3, 4
  br i1 %4, label %if_true, label %if_merge

while_merge:                    ; preds = %if_merge, %entry
  %5 = load i32, i32* %y
  ret i32 %5

if_true:                        ; preds = %while_body
  %z = alloca i32
  store i32 %x, i32* %z
  %6 = load i32, i32* %z
  store i32 %6, i32* %y
  br label %if_merge

if_merge:                       ; preds = %if_true, %while_body
  %7 = load i32, i32* %y
  %8 = add i32 %7, 42
  store i32 %8, i32* %y
  %9 = load i32, i32* %y
  %10 = icmp sgt i32 %9, 13
  br i1 %10, label %while_body, label %while_merge
}

define i32 @main() {
entry:
  %0 = call i32 @foo(i32 13)
  ret i32 %0
}

You can check out the full GitHub pull request for the feature here.

In the next post, we’ll look into how we can add support for pointers to the language, stay tuned!