16 Jun

OpenCL -> Vulkan: A Porting Guide (#2)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is a series of blog posts on how to port from OpenCL to Vulkan:

  1. OpenCL -> Vulkan: A Porting Guide (#1)

In this post, we’ll cover porting from OpenCL’s cl_command_queue to Vulkan’s VkQueue.

cl_command_queue -> VkCommandBuffer and VkQueue

OpenCL made a poor choice when cl_command_queue was designed. A cl_command_queue is an amalgamation of two very distinct things:

  1. A collection of workloads to run on some hardware
  2. A thing that will run various workloads and allow interactions between them

Vulkan broke this into the two constituent parts, for 1. we have a VkCommandBuffer, an encapsulation of one or more commands to run on a device. For 2. we have a VkQueue, the thing that will actually run these commands and allow us to synchronize on the result.

Without diving too deeply, Vulkan’s approach allows for a selection of commands to be built once, and then run multiple times. For a huge number of compute workloads we run on datasets, we’re running the same set of commands thousands of times – and Vulkan allows us to amortise the cost of building up this collection of commands to run.

Back to OpenCL, we use clCreateCommandQueue (for pre 2.0) / clCreateCommandQueueWithProperties to create this amalgamated ‘collection of things I want you to run and a way of running them’. We’ll enable CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE as that is the behaviour of a Vulkan VkQueue (although remember that not all OpenCL devices actually support out of order queues – I’m doing this to allow the mental mapping of how Vulkan executes command buffers on queues to bake into your mind).

cl_queue_properties queueProperties[3] = {
    CL_QUEUE_PROPERTIES,
    CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
    0
};

cl_command_queue queue = clCreateCommandQueueWithProperties(
    context,
    device,
    queueProperties,
    &errorcode);

if (CL_SUCCESS != errorcode) {
 // ... error!
}

The corresponding object in Vulkan is the VkQueue – which we get from the device, rather than creating as OpenCL does. This is because a queue in Vulkan is more like a physical aspect of the device, rather than some software construct – this isn’t mandated in the specification, but its a useful mental model to adopt when thinking about Vulkan’s queues.

Remember that when we created our VkDevice we requested which queue families we wanted to use with the device? Now to actually get a queue that supports compute, we have to choose one of the queue family indices that supported compute, and get the corresponding VkQueue from that queue family.

VkQueue queue;

uint32_t queueFamilyIndex = UINT32_MAX;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT & queueFamilyProperties[i].queueFlags) {
    queueFamilyIndex = i;
    break;
  }
}

if (UINT_MAX == queueFamilyIndex) {
  // ... error!
}

vkGetDeviceQueue(device, queueFamilyIndex, 0, &queue);

clEnqueue* vs vkCmd*

To actually execute something on a device, OpenCL uses commands that begin with clEnqueue* – this command will enqueue work onto a command queue and possibly begin execution it. Why possibly? OpenCL is utterly vague on when commands actually begin executing. The specification states that a call to clFlush, clFinish, or clWaitForEvents on an event that is being signalled by a previously enqueued command on a command queue will guarantee that the device has actually begun executing. It is entirely valid that an implementation begin executing work when the clEnqueue* command is called, and equally valid that the implementation delays until a bunch of clEnqueue* commands are in the queue and the corresponding clFlush/clFinish/clWaitForEvents is called.

cl_mem src, dst; // Two previously created buffers

cl_event event;
if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    src,
    dst,
    0, // src offset
    0, // dst offset
    42, // size in bytes to copy
    0,
    nullptr,
    &event)) {
  // ... error!
}

// If we were going to enqueue more stuff on the command queue,
// but wanted the above command to definitely begin execution,
// we'd call flush here.
if (CL_SUCCESS != clFlush(queue)) {
  // ... error!
}

// We could either call finish...
if (CL_SUCCESS != clFinish(queue)) {
  // ... error!
}

// ... or wait for the event we used!
if (CL_SUCCESS != clWaitForEvents(1, &event)) {
  // ... error!
}

In contrast, Vulkan requires us to submit all our commands into a VkCommandBuffer. First we need to create the command buffer.

VkCommandPoolCreateInfo commandPoolCreateInfo = {
  VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO,
  0,
  0,
  queueFamilyIndex
};

VkCommandPool commandPool;

if (VK_SUCCESS != vkCreateCommandPool(
    device,
    &commandPoolCreateInfo,
    0,
    &commandPool)) {
  // ... error!
}

VkCommandBufferAllocateInfo commandBufferAllocateInfo = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO,
  0,
  commandPool,
  VK_COMMAND_BUFFER_LEVEL_PRIMARY,
  1 // We are creating one command buffer.
};

VkCommandBuffer commandBuffer;

if (VK_SUCCESS != vkAllocateCommandBuffers(
    device,
    &commandBufferAllocateInfo,
    &commandBuffer)) {
  // ... error!
}

Now we have our command buffer with which we can queue up commands to execute on a Vulkan queue.

VkBuffer src, dst; // Two previously created buffers

VkCommandBufferBeginInfo commandBufferBeginInfo = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO,
  0,
  VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT,
  0
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

vkCmdCopyBuffer(commandBuffer, src, dst, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer)) {
  // ... error!
}

VkFenceCreateInfo fenceCreateInfo = {
  VK_STRUCTURE_TYPE_FENCE_CREATE_INFO,
  0,
  0
};

VkFence fence;

if (VK_SUCESS != VkFenceCreateInfo(
    device,
    &fenceCreateInfo,
    0,
    &fence)) {
  // ... error!
}

VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    fence)) {
  // ... error!
}

// We can either wait on our commands to complete by fencing...
if (VK_SUCCESS != vkWaitForFences(
    device,
    1,
    &fence,
    VK_TRUE,
    UINT64_MAX)) {
  // ... error!
}

// ... or waiting for the entire queue to have finished...
if (VK_SUCCESS != vkQueueWaitIdle(queue)) {
  // ... error!
}

// ... or even for the entire device to be idle!
if (VK_SUCCESS != vkDeviceWaitIdle(device)) {
  // ... error!
}

Vulkan gives us many more ways to synchronize on host for when we are complete with our workload. We can specify a VkFence to our queue submission to wait on one of more command buffers in that submit, we can wait for the queue to be idle, or even wait for the entire device to be idle! Fences and command buffers can be reused by calling VkResetFences and VkResetCommandBuffer respectively – note that the command buffer can be reused for an entirely different set of commands to be executed. If you wanted to resubmit the exact same command buffer, you’d have to remove VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT flag in the VkCommandBufferBeginInfo struct above.

So a crucial thing to note here – synchronizing on a cl_command_queue is similar to a VkQueue, but the mechanisms are not identical.

We’ll cover these queue synchronization mechanisms in more detail in the next post in the series.