29 Jun

OpenCL -> Vulkan: A Porting Guide (#3)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is a series of blog posts on how to port from OpenCL to Vulkan:

  1. OpenCL -> Vulkan: A Porting Guide (#1)
  2. OpenCL -> Vulkan: A Porting Guide (#2)

In this post, we’ll cover the different queue synchronization mechanisms in OpenCL and Vulkan.

clFinish vs vkWaitForFences

In the previous post I explained that an OpenCL queue (cl_command_queue) was an amalgamation of two distinct concepts:

  1. A collection of workloads to run on some hardware
  2. A thing that will run various workloads and allow interactions between them

Whereas Vulkan uses a VkCommandBuffer for 1, and a VkQueue for 2.

One common synchronization users want to do is let a queue execute a bunch of work, and wait for all that work to be done.

In OpenCL, you can wait on all previously submitted commands to a queue by using clFinish.

cl_command_queue queue; // previously created

// submit work to the queue
if (CL_SUCCESS != clFinish(queue)) {
  // ... error!
}

In Vulkan, because a queue is just a thing to run workloads on, we instead have to wait on the command buffer itself to complete. This is done via a VkFence which is specified when submitting work to a VkQueue.

VkCommandBuffer commandBuffer; // previously created
VkFence fence; // previously created

// submit work to the commandBuffer

VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    fence)) {
  // ... error!
}

if (VK_SUCCESS != vkWaitForFences(
    device,
    1,
    &fence,
    VK_TRUE,
    UINT64_MAX)) {
  // ... error!
}

One thing to note is that you can wait on a Vulkan queue to finish all submitted workloads, but remember the difference between Vulkan queues and OpenCL queues. Vulkan queue’s are retrieved from a device. If multiple parts of your code (including third party libraries) retrieve the same Vulkan queue and are executing workloads on it, you will end up waiting for someone else’s work to complete.

TL;DR – waiting on a queue in Vulkan is not the same as OpenCL.

Dependencies within a cl_command_queue / VkCommandBuffer

Both OpenCL and Vulkan have mechanisms to ensure a command will only begin executing once another command has completed.

Firstly, remember that an OpenCL command queue by default will be in order. What this means is that by default when you submit commands into an OpenCL command queue each command will only begin executing once the preceding command has completed. While this isn’t ideal in a number of situations for performance, it is advantageous for users to get up and running in a safe and quick manner.

OpenCL also allows command queue’s to be out of order. This means that commands submitted to a queue are guaranteed to be dispatched in order but that they may run concurrently and/or complete out of order.

Using an out of order OpenCL queue, to get commands to wait on other commands before beginning executing, you use a cl_event to create a dependency between both the commands.

cl_buffer bufferA, bufferB, bufferC; // previously created

cl_event event;

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    bufferA,
    bufferB,
    0,
    0,
    42,
    0,
    nullptr,
    &event)) {
  // ... error!
}

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    bufferB,
    bufferC,
    0,
    0,
    42,
    1,
    &event,
    nullptr)) {
  // ... error!
}

We can guarantee that if queue above was an out of order queue, the commands would still be executed in order because we expressed the dependency between both commands.

In Vulkan queues are out of order. There is also no exact matching mechanism to get two arbitrary commands to depend on one another. Vulkan relies on more knowledge of what you are actually trying to do to create the right kind of synchronization between commands.

The easiest (and in no way more performant) way to map OpenCL code with an event dependency between two commands, or if the OpenCL queue was created in order, is to have separate Vulkan command buffers for each command. While this might seem crude, it’ll allow you to use another of Vulkan’s synchronization mechanisms to solve the problem – the semaphore.

VkBuffer bufferA, bufferB, bufferC; // previously created
VkCommandBuffer commandBuffer1; // previously created
VkCommandBuffer commandBuffer2; // previously created

VkSemaphoreCreateInfo semaphoreCreateInfo = {
  VK_STRUCTURE_TYPE_SEMAPHORE_CREATE_INFO,
  nullptr,
  0
};

VkSemaphore semaphore;

if (VK_SUCCESS != vkCreateSemaphore(
    device,
    &semaphoreCreateInfo,
    nullptr,
    &semaphore)) {
  // ... error!
}

VkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer1,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer1, bufferA, bufferB, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer1)) {
  // ... error!
}
VkSubmitInfo submitInfo1 = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer1,
  1,
  &semaphore,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo1,
    nullptr)) {
  // ... error!
}

VkPipelineStageFlags pipelineStageFlags =
    VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer2,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer2, bufferB, bufferC, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer2)) {
  // ... error!
}

VkSubmitInfo submitInfo2 = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  1,
  &semaphore,
  &pipelineStageFlags,
  1,
  &commandBuffer2,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo2,
    nullptr)) {
  // ... error!
}

A Vulkan semaphore allows you to express dependencies between command buffers. So by placing each command into a command buffer we can use a semaphore between these command buffers to emulate the OpenCL behaviour of in order queues and arbitrary command dependencies.

As with everything in Vulkan – the way to get performance is to explain to the driver exactly what you intend to do. In our example where we are copying data from buffer A -> buffer B -> buffer C above, we are basically creating a dependency on our usage of buffer B. The copy from buffer B -> buffer C cannot begin until the copy from buffer A -> buffer B has complete. So Vulkan gives us the tools to tell the driver about this dependency explicitly, and we can use them within a single command buffer.

The most analogous approach to the OpenCL example is to use a Vulkan event to encode the dependency.

VkEventCreateInfo eventCreateInfo = {
  VK_STRUCTURE_TYPE_EVENT_CREATE_INFO,
  nullptr,
  0
};

VkEvent event;

if (VK_SUCCESS != vkCreateEvent(
    device,
    &eventCreateInfo,
    nullptr,
    &event)) {
  // ... error!
}

Note that we create the event explicitly with Vulkan, unlike in OpenCL where any clEnqueue* command has an optional out_event parameter as the last parameter.

VkBuffer bufferA, bufferB, bufferC; // previously created
VkCommandBuffer commandBuffer; // previously created

VkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer, bufferA, bufferB, 1, &bufferCopy);

vkCmdSetEvent(
    commandBuffer, 
    event,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT);

VkMemoryBarrier memoryBarrier = {
  VK_STRUCTURE_TYPE_MEMORY_BARRIER,
  nullptr,
  VK_ACCESS_MEMORY_WRITE_BIT,
  VK_ACCESS_MEMORY_READ_BIT
};

vkCmdWaitEvents(
    commandBuffer,
    1,
    &event,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    1,
    &memoryBarrier,
    0,
    nullptr,
    0,
    nullptr)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer, bufferB, bufferC, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer)) {
  // ... error!
}
VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    nullptr)) {
  // ... error!
}

So to do a similar thing to OpenCL’s event chaining semantics we:

  1. add our buffer A -> buffer B copy command
  2. set an event that will trigger when all previous commands are complete, in our case the current set of all previous commands is the one existing copy buffer command
  3. wait for the previous event to complete, specifying that all memory operations that performed a write before this wait must be resolved, and that all read operations after this event can read them
  4. add our buffer B -> buffer C copy command

Now we can be even more explicit with Vulkan and specifically use VK_ACCESS_TRANSFER_READ_BIT and VK_ACCESS_TRANSFER_WRITE_BIT – but I’m using the much more inclusive VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT to be clear what OpenCL will be doing implicitly for you as a user.

Dependencies between multiple cl_command_queue’s / VkCommandBuffer’s

When synchronizing between multiple cl_command_queue’s in OpenCL we use the exact same mechanism as with one queue.

cl_buffer bufferA, bufferB, bufferC; // previously created
cl_command_queue queue1, queue2; // previously created

cl_event event;

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue1,
    bufferA,
    bufferB,
    0,
    0,
    42,
    0,
    nullptr,
    &event)) {
  // ... error!
}

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue2,
    bufferB,
    bufferC,
    0,
    0,
    42,
    1,
    &event,
    nullptr)) {
  // ... error!
}

The command queue queue2 will not begin executing the copy buffer command until the first command queue queue1 has completed its execution. Having the same mechanism for creating dependencies within a queue and outwith a queue is a very nice thing from a user perspective – there is one true way to create a synchronization between commands in OpenCL.

In Vulkan, when we are wanting to create a dependency between two VkCommandBuffer’s the easiest way is to use the semaphore approach I showed above. You could also use a VkEvent that is triggered at the end of one command buffer and waited on at the beginning of another. If you want to amortize the cost of doing multiple submits to the same queue, then use the event approach.

You can also use both of these mechanisms to create dependencies between multiple Vulkan queues. Remember that a Vulkan queue can be thought of as an exposition of some physical concurrency in the hardware, or in other words, running things on two distinct queues concurrently can lead to a performance improvement.

I recommend using a semaphore as the mechanism to encode dependencies between queues for the most part as it is simpler to get right.

The main place where using the event approach is when you have a long command buffer, where after only a few commands you can unblock the concurrently runnable queue to begin execution. In this case you’d be better using an event as that will enable the other queue to begin executing much earlier than would previously be possible.

clEnqueueBarrierWithWaitList vs vkCmdPipelineBarrier

Both OpenCL and Vulkan have a barrier that acts as a memory and execution barrier. When you have a pattern whereby you have N commands that must have completed execution before another M commands begin, a barrier is normally the answer.

// N commands before here...

if (CL_SUCCESS != clEnqueueBarrierWithWaitList(
    queue,
    0,
    nullptr,
    nullptr)) {
  // ... error!
}

// M commands after here will only begin once
// the previous N commands have completed!

And the corresponding Vulkan:

VkMemoryBarrier memoryBarrier = {
  VK_STRUCTURE_TYPE_MEMORY_BARRIER,
  nullptr,
  VK_ACCESS_MEMORY_WRITE_BIT,
  VK_ACCESS_MEMORY_READ_BIT
};

// N commands before here...

vkCmdPipelineBarrier(
    commandBuffer,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    1,
    &memoryBarrier,
    0,
    nullptr,
    0,
    nullptr)) {
  // ... error!
}

// M commands after here will only begin once
// the previous N commands have completed!

What’s next?

After this monstrous dive into porting OpenCL’s synchronization mechanisms to Vulkan, in the next post we’ll look at the differences between OpenCL’s kernels and Vulkan’s pipelines – stay tuned!