29 Jun

OpenCL -> Vulkan: A Porting Guide (#3)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is a series of blog posts on how to port from OpenCL to Vulkan:

  1. OpenCL -> Vulkan: A Porting Guide (#1)
  2. OpenCL -> Vulkan: A Porting Guide (#2)

In this post, we’ll cover the different queue synchronization mechanisms in OpenCL and Vulkan.

clFinish vs vkWaitForFences

In the previous post I explained that an OpenCL queue (cl_command_queue) was an amalgamation of two distinct concepts:

  1. A collection of workloads to run on some hardware
  2. A thing that will run various workloads and allow interactions between them

Whereas Vulkan uses a VkCommandBuffer for 1, and a VkQueue for 2.

One common synchronization users want to do is let a queue execute a bunch of work, and wait for all that work to be done.

In OpenCL, you can wait on all previously submitted commands to a queue by using clFinish.

cl_command_queue queue; // previously created

// submit work to the queue
if (CL_SUCCESS != clFinish(queue)) {
  // ... error!
}

In Vulkan, because a queue is just a thing to run workloads on, we instead have to wait on the command buffer itself to complete. This is done via a VkFence which is specified when submitting work to a VkQueue.

VkCommandBuffer commandBuffer; // previously created
VkFence fence; // previously created

// submit work to the commandBuffer

VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    fence)) {
  // ... error!
}

if (VK_SUCCESS != vkWaitForFences(
    device,
    1,
    &fence,
    VK_TRUE,
    UINT64_MAX)) {
  // ... error!
}

One thing to note is that you can wait on a Vulkan queue to finish all submitted workloads, but remember the difference between Vulkan queues and OpenCL queues. Vulkan queue’s are retrieved from a device. If multiple parts of your code (including third party libraries) retrieve the same Vulkan queue and are executing workloads on it, you will end up waiting for someone else’s work to complete.

TL;DR – waiting on a queue in Vulkan is not the same as OpenCL.

Dependencies within a cl_command_queue / VkCommandBuffer

Both OpenCL and Vulkan have mechanisms to ensure a command will only begin executing once another command has completed.

Firstly, remember that an OpenCL command queue by default will be in order. What this means is that by default when you submit commands into an OpenCL command queue each command will only begin executing once the preceding command has completed. While this isn’t ideal in a number of situations for performance, it is advantageous for users to get up and running in a safe and quick manner.

OpenCL also allows command queue’s to be out of order. This means that commands submitted to a queue are guaranteed to be dispatched in order but that they may run concurrently and/or complete out of order.

Using an out of order OpenCL queue, to get commands to wait on other commands before beginning executing, you use a cl_event to create a dependency between both the commands.

cl_buffer bufferA, bufferB, bufferC; // previously created

cl_event event;

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    bufferA,
    bufferB,
    0,
    0,
    42,
    0,
    nullptr,
    &event)) {
  // ... error!
}

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    bufferB,
    bufferC,
    0,
    0,
    42,
    1,
    &event,
    nullptr)) {
  // ... error!
}

We can guarantee that if queue above was an out of order queue, the commands would still be executed in order because we expressed the dependency between both commands.

In Vulkan queues are out of order. There is also no exact matching mechanism to get two arbitrary commands to depend on one another. Vulkan relies on more knowledge of what you are actually trying to do to create the right kind of synchronization between commands.

The easiest (and in no way more performant) way to map OpenCL code with an event dependency between two commands, or if the OpenCL queue was created in order, is to have separate Vulkan command buffers for each command. While this might seem crude, it’ll allow you to use another of Vulkan’s synchronization mechanisms to solve the problem – the semaphore.

VkBuffer bufferA, bufferB, bufferC; // previously created
VkCommandBuffer commandBuffer1; // previously created
VkCommandBuffer commandBuffer2; // previously created

VkSemaphoreCreateInfo semaphoreCreateInfo = {
  VK_STRUCTURE_TYPE_SEMAPHORE_CREATE_INFO,
  nullptr,
  0
};

VkSemaphore semaphore;

if (VK_SUCCESS != vkCreateSemaphore(
    device,
    &semaphoreCreateInfo,
    nullptr,
    &semaphore)) {
  // ... error!
}

VkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer1,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer1, bufferA, bufferB, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer1)) {
  // ... error!
}
VkSubmitInfo submitInfo1 = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer1,
  1,
  &semaphore,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo1,
    nullptr)) {
  // ... error!
}

VkPipelineStageFlags pipelineStageFlags =
    VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer2,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer2, bufferB, bufferC, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer2)) {
  // ... error!
}

VkSubmitInfo submitInfo2 = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  1,
  &semaphore,
  &pipelineStageFlags,
  1,
  &commandBuffer2,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo2,
    nullptr)) {
  // ... error!
}

A Vulkan semaphore allows you to express dependencies between command buffers. So by placing each command into a command buffer we can use a semaphore between these command buffers to emulate the OpenCL behaviour of in order queues and arbitrary command dependencies.

As with everything in Vulkan – the way to get performance is to explain to the driver exactly what you intend to do. In our example where we are copying data from buffer A -> buffer B -> buffer C above, we are basically creating a dependency on our usage of buffer B. The copy from buffer B -> buffer C cannot begin until the copy from buffer A -> buffer B has complete. So Vulkan gives us the tools to tell the driver about this dependency explicitly, and we can use them within a single command buffer.

The most analogous approach to the OpenCL example is to use a Vulkan event to encode the dependency.

VkEventCreateInfo eventCreateInfo = {
  VK_STRUCTURE_TYPE_EVENT_CREATE_INFO,
  nullptr,
  0
};

VkEvent event;

if (VK_SUCCESS != vkCreateEvent(
    device,
    &eventCreateInfo,
    nullptr,
    &event)) {
  // ... error!
}

Note that we create the event explicitly with Vulkan, unlike in OpenCL where any clEnqueue* command has an optional out_event parameter as the last parameter.

VkBuffer bufferA, bufferB, bufferC; // previously created
VkCommandBuffer commandBuffer; // previously created

VkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer, bufferA, bufferB, 1, &bufferCopy);

vkCmdSetEvent(
    commandBuffer, 
    event,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT);

VkMemoryBarrier memoryBarrier = {
  VK_STRUCTURE_TYPE_MEMORY_BARRIER,
  nullptr,
  VK_ACCESS_MEMORY_WRITE_BIT,
  VK_ACCESS_MEMORY_READ_BIT
};

vkCmdWaitEvents(
    commandBuffer,
    1,
    &event,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    1,
    &memoryBarrier,
    0,
    nullptr,
    0,
    nullptr)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer, bufferB, bufferC, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer)) {
  // ... error!
}
VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    nullptr)) {
  // ... error!
}

So to do a similar thing to OpenCL’s event chaining semantics we:

  1. add our buffer A -> buffer B copy command
  2. set an event that will trigger when all previous commands are complete, in our case the current set of all previous commands is the one existing copy buffer command
  3. wait for the previous event to complete, specifying that all memory operations that performed a write before this wait must be resolved, and that all read operations after this event can read them
  4. add our buffer B -> buffer C copy command

Now we can be even more explicit with Vulkan and specifically use VK_ACCESS_TRANSFER_READ_BIT and VK_ACCESS_TRANSFER_WRITE_BIT – but I’m using the much more inclusive VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT to be clear what OpenCL will be doing implicitly for you as a user.

Dependencies between multiple cl_command_queue’s / VkCommandBuffer’s

When synchronizing between multiple cl_command_queue’s in OpenCL we use the exact same mechanism as with one queue.

cl_buffer bufferA, bufferB, bufferC; // previously created
cl_command_queue queue1, queue2; // previously created

cl_event event;

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue1,
    bufferA,
    bufferB,
    0,
    0,
    42,
    0,
    nullptr,
    &event)) {
  // ... error!
}

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue2,
    bufferB,
    bufferC,
    0,
    0,
    42,
    1,
    &event,
    nullptr)) {
  // ... error!
}

The command queue queue2 will not begin executing the copy buffer command until the first command queue queue1 has completed its execution. Having the same mechanism for creating dependencies within a queue and outwith a queue is a very nice thing from a user perspective – there is one true way to create a synchronization between commands in OpenCL.

In Vulkan, when we are wanting to create a dependency between two VkCommandBuffer’s the easiest way is to use the semaphore approach I showed above. You could also use a VkEvent that is triggered at the end of one command buffer and waited on at the beginning of another. If you want to amortize the cost of doing multiple submits to the same queue, then use the event approach.

You can also use both of these mechanisms to create dependencies between multiple Vulkan queues. Remember that a Vulkan queue can be thought of as an exposition of some physical concurrency in the hardware, or in other words, running things on two distinct queues concurrently can lead to a performance improvement.

I recommend using a semaphore as the mechanism to encode dependencies between queues for the most part as it is simpler to get right.

The main place where using the event approach is when you have a long command buffer, where after only a few commands you can unblock the concurrently runnable queue to begin execution. In this case you’d be better using an event as that will enable the other queue to begin executing much earlier than would previously be possible.

clEnqueueBarrierWithWaitList vs vkCmdPipelineBarrier

Both OpenCL and Vulkan have a barrier that acts as a memory and execution barrier. When you have a pattern whereby you have N commands that must have completed execution before another M commands begin, a barrier is normally the answer.

// N commands before here...

if (CL_SUCCESS != clEnqueueBarrierWithWaitList(
    queue,
    0,
    nullptr,
    nullptr)) {
  // ... error!
}

// M commands after here will only begin once
// the previous N commands have completed!

And the corresponding Vulkan:

VkMemoryBarrier memoryBarrier = {
  VK_STRUCTURE_TYPE_MEMORY_BARRIER,
  nullptr,
  VK_ACCESS_MEMORY_WRITE_BIT,
  VK_ACCESS_MEMORY_READ_BIT
};

// N commands before here...

vkCmdPipelineBarrier(
    commandBuffer,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    1,
    &memoryBarrier,
    0,
    nullptr,
    0,
    nullptr)) {
  // ... error!
}

// M commands after here will only begin once
// the previous N commands have completed!

What’s next?

After this monstrous dive into porting OpenCL’s synchronization mechanisms to Vulkan, in the next post we’ll look at the differences between OpenCL’s kernels and Vulkan’s pipelines – stay tuned!

16 Jun

OpenCL -> Vulkan: A Porting Guide (#2)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is a series of blog posts on how to port from OpenCL to Vulkan:

  1. OpenCL -> Vulkan: A Porting Guide (#1)

In this post, we’ll cover porting from OpenCL’s cl_command_queue to Vulkan’s VkQueue.

cl_command_queue -> VkCommandBuffer and VkQueue

OpenCL made a poor choice when cl_command_queue was designed. A cl_command_queue is an amalgamation of two very distinct things:

  1. A collection of workloads to run on some hardware
  2. A thing that will run various workloads and allow interactions between them

Vulkan broke this into the two constituent parts, for 1. we have a VkCommandBuffer, an encapsulation of one or more commands to run on a device. For 2. we have a VkQueue, the thing that will actually run these commands and allow us to synchronize on the result.

Without diving too deeply, Vulkan’s approach allows for a selection of commands to be built once, and then run multiple times. For a huge number of compute workloads we run on datasets, we’re running the same set of commands thousands of times – and Vulkan allows us to amortise the cost of building up this collection of commands to run.

Back to OpenCL, we use clCreateCommandQueue (for pre 2.0) / clCreateCommandQueueWithProperties to create this amalgamated ‘collection of things I want you to run and a way of running them’. We’ll enable CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE as that is the behaviour of a Vulkan VkQueue (although remember that not all OpenCL devices actually support out of order queues – I’m doing this to allow the mental mapping of how Vulkan executes command buffers on queues to bake into your mind).

cl_queue_properties queueProperties[3] = {
    CL_QUEUE_PROPERTIES,
    CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
    0
};

cl_command_queue queue = clCreateCommandQueueWithProperties(
    context,
    device,
    queueProperties,
    &errorcode);

if (CL_SUCCESS != errorcode) {
 // ... error!
}

The corresponding object in Vulkan is the VkQueue – which we get from the device, rather than creating as OpenCL does. This is because a queue in Vulkan is more like a physical aspect of the device, rather than some software construct – this isn’t mandated in the specification, but its a useful mental model to adopt when thinking about Vulkan’s queues.

Remember that when we created our VkDevice we requested which queue families we wanted to use with the device? Now to actually get a queue that supports compute, we have to choose one of the queue family indices that supported compute, and get the corresponding VkQueue from that queue family.

VkQueue queue;

uint32_t queueFamilyIndex = UINT32_MAX;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT & queueFamilyProperties[i].queueFlags) {
    queueFamilyIndex = i;
    break;
  }
}

if (UINT_MAX == queueFamilyIndex) {
  // ... error!
}

vkGetDeviceQueue(device, queueFamilyIndex, 0, &queue);

clEnqueue* vs vkCmd*

To actually execute something on a device, OpenCL uses commands that begin with clEnqueue* – this command will enqueue work onto a command queue and possibly begin execution it. Why possibly? OpenCL is utterly vague on when commands actually begin executing. The specification states that a call to clFlush, clFinish, or clWaitForEvents on an event that is being signalled by a previously enqueued command on a command queue will guarantee that the device has actually begun executing. It is entirely valid that an implementation begin executing work when the clEnqueue* command is called, and equally valid that the implementation delays until a bunch of clEnqueue* commands are in the queue and the corresponding clFlush/clFinish/clWaitForEvents is called.

cl_mem src, dst; // Two previously created buffers

cl_event event;
if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    src,
    dst,
    0, // src offset
    0, // dst offset
    42, // size in bytes to copy
    0,
    nullptr,
    &event)) {
  // ... error!
}

// If we were going to enqueue more stuff on the command queue,
// but wanted the above command to definitely begin execution,
// we'd call flush here.
if (CL_SUCCESS != clFlush(queue)) {
  // ... error!
}

// We could either call finish...
if (CL_SUCCESS != clFinish(queue)) {
  // ... error!
}

// ... or wait for the event we used!
if (CL_SUCCESS != clWaitForEvents(1, &event)) {
  // ... error!
}

In contrast, Vulkan requires us to submit all our commands into a VkCommandBuffer. First we need to create the command buffer.

VkCommandPoolCreateInfo commandPoolCreateInfo = {
  VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO,
  0,
  0,
  queueFamilyIndex
};

VkCommandPool commandPool;

if (VK_SUCCESS != vkCreateCommandPool(
    device,
    &commandPoolCreateInfo,
    0,
    &commandPool)) {
  // ... error!
}

VkCommandBufferAllocateInfo commandBufferAllocateInfo = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO,
  0,
  commandPool,
  VK_COMMAND_BUFFER_LEVEL_PRIMARY,
  1 // We are creating one command buffer.
};

VkCommandBuffer commandBuffer;

if (VK_SUCCESS != vkAllocateCommandBuffers(
    device,
    &commandBufferAllocateInfo,
    &commandBuffer)) {
  // ... error!
}

Now we have our command buffer with which we can queue up commands to execute on a Vulkan queue.

VkBuffer src, dst; // Two previously created buffers

VkCommandBufferBeginInfo commandBufferBeginInfo = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO,
  0,
  VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT,
  0
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

vkCmdCopyBuffer(commandBuffer, src, dst, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer)) {
  // ... error!
}

VkFenceCreateInfo fenceCreateInfo = {
  VK_STRUCTURE_TYPE_FENCE_CREATE_INFO,
  0,
  0
};

VkFence fence;

if (VK_SUCESS != VkFenceCreateInfo(
    device,
    &fenceCreateInfo,
    0,
    &fence)) {
  // ... error!
}

VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    fence)) {
  // ... error!
}

// We can either wait on our commands to complete by fencing...
if (VK_SUCCESS != vkWaitForFences(
    device,
    1,
    &fence,
    VK_TRUE,
    UINT64_MAX)) {
  // ... error!
}

// ... or waiting for the entire queue to have finished...
if (VK_SUCCESS != vkQueueWaitIdle(queue)) {
  // ... error!
}

// ... or even for the entire device to be idle!
if (VK_SUCCESS != vkDeviceWaitIdle(device)) {
  // ... error!
}

Vulkan gives us many more ways to synchronize on host for when we are complete with our workload. We can specify a VkFence to our queue submission to wait on one of more command buffers in that submit, we can wait for the queue to be idle, or even wait for the entire device to be idle! Fences and command buffers can be reused by calling VkResetFences and VkResetCommandBuffer respectively – note that the command buffer can be reused for an entirely different set of commands to be executed. If you wanted to resubmit the exact same command buffer, you’d have to remove VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT flag in the VkCommandBufferBeginInfo struct above.

So a crucial thing to note here – synchronizing on a cl_command_queue is similar to a VkQueue, but the mechanisms are not identical.

We’ll cover these queue synchronization mechanisms in more detail in the next post in the series.

06 Jun

OpenCL -> Vulkan: A Porting Guide (#1)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is part 1 of my guide for how things map between the APIs!

cl_platform_id -> VkInstance

In OpenCL, the first thing you do is get the platform identifiers (using clGetPlatformIDs).

// We do not strictly need to initialize this to 0 (as it'll
// be set by clGetPlatformIDs), but given a lot people do
// not check the error code returns, it's safer to 0
// initialize.
cl_uint numPlatforms = 0;
if (CL_SUCCESS != clGetPlatformIDs(
    0,
    nullptr,
    &numPlatforms)) {
  // ... error!
}

std::vector<cl_platform_id> platforms(numPlatforms);

if (CL_SUCCESS != clGetPlatformIDs(
    platforms.size(),
    platforms.data(),
    nullptr)) {
  // ... error!
}

Each cl_platform_id is a handle into an individual vendors OpenCL driver – if you had an AMD and NVIDIA implementation of OpenCL on your system, you’d get two cl_platform_id’s returned.

Vulkan is different here – instead of getting one or more handles to individual vendors implementations, we instead create a single VkInstance (via vkCreateInstance).

const VkApplicationInfo applicationInfo = {
  VK_STRUCTURE_TYPE_APPLICATION_INFO,
  0,
  "MyAwesomeApplication",
  0,
  "",
  0,
  VK_MAKE_VERSION(1, 0, 0)
};
 
const VkInstanceCreateInfo instanceCreateInfo = {
  VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO,
  0,
  0,
  &applicationInfo,
  0,
  0,
  0,
  0
};
 
VkInstance instance;
if (VK_SUCCESS != vkCreateInstance(
    &instanceCreateInfo,
    0,
    &instance)) {
  // ... error!
}

This single instance allows us to access multiple vendor implementations of the Vulkan API through a single object.

cl_device_id -> VkPhysicalDevice

In OpenCL, you can query one or more cl_device_id’s from each cl_platform_id that we previously queried (via clGetDeviceIDs). When querying for a device, we can specify a cl_device_type, where you can basically ask the driver to give you its default device (normally a GPU) or for a specific device type. We’ll use CL_DEVICE_TYPE_ALL, in that we are instructing the driver to return all the devices it knows about, and we can choose from them.

cl_uint numDevices = 0;

for (cl_uint i = 0; i < platforms.size(); i++) {
  // We do not strictly need to initialize this to 0 (as it'll
  // be set by clGetDeviceIDs), but given a lot people do
  // not check the error code returns, it's safer to 0
  // initialize.
  cl_uint numDevicesForPlatform = 0;

  if (CL_SUCCESS != clGetDeviceIDs(
      platforms[i],
      CL_DEVICE_TYPE_ALL,
      0,
      nullptr,
      &numDevicesForPlatform)) {
    // ... error!
  }

  numDevices += numDevicesForPlatform;
}

std::vector<cl_device_id> devices(numDevices);

// reset numDevices as we'll use it for our insertion offset
numDevices = 0;

for (cl_uint i = 0; i < platforms.size(); i++) {
  cl_uint numDevicesForPlatform = 0;

  if (CL_SUCCESS != clGetDeviceIDs(
      platforms[i],
      CL_DEVICE_TYPE_ALL,
      0,
      nullptr,
      &numDevicesForPlatform)) {
    // ... error!
  }

  if (CL_SUCCESS != clGetDeviceIDs(
      platforms[i],
      CL_DEVICE_TYPE_ALL,
      numDevicesForPlatform,
      devices.data() + numDevices,
      nullptr)) {
    // ... error!
  }

  numDevices += numDevicesForPlatform;
}

The code above is a bit of a mouthful – but it is the easiest way to get every device that the system knows about.

In contrast, since Vulkan gave us a single VkInstance, we query that single instance for all of the VkPhysicalDevice’s it knows about (via vkEnumeratePhysicalDevices). A Vulkan physical device is a link to the actual hardware that the Vulkan code is going to execute on.

uint32_t physicalDeviceCount = 0;

if (VK_SUCCESS != vkEnumeratePhysicalDevices(
    instance,
    &physicalDeviceCount,
    0)) {
  // ... error!
}

std::vector<VkPhysicalDevice> physicalDevices(physicalDeviceCount);

if (VK_SUCCESS != vkEnumeratePhysicalDevices(
    instance,
    &physicalDeviceCount,
    physicalDevices.data())) {
  // ... error!
}

A prominent API design fork can be seen between vkEnumeratePhysicalDevices and clGetDeviceIDs – Vulkan reuses the integer return parameter to the function (the parameter that lets you query the number of physical devices present) to also pass into the driver the number of physical devices we want filled out. In contrast, OpenCL uses an extra parameter for this. These patterns are repeated throughout both APIs.

cl_context -> VkDevice

Here is where it gets trickier between the APIs. OpenCL has a notion of a context – you can think of this object as your way as the user to view and interact with what the system is doing. OpenCL allows multiple device’s that belong to a single platform to be shared within a context. In contrast, Vulkan is fixed to having a single physical device per it’s ‘context’, which Vulkan calls a VkDevice.

To make the porting easier, and because in all honesty I’ve yet to see any real use-case or benefit from having multiple OpenCL devices in a single context, we’ll make our OpenCL code create it’s cl_context using a single cl_device_id (via clCreateContext).

// One of the devices in our std::vector
cl_device_id device = ...;

cl_int errorcode;

cl_context context = clCreateContext(
    nullptr,
    1,
    &device,
    nullptr,
    nullptr,
    &errorcode);

if (CL_SUCCESS != errorcode) {
  // ... error!
}

The above highlights the single biggest travesty in the OpenCL API – the error code has changed from being something returned from the API call, to an optional pointer parameter at the end of the signature. In API design, I’d say this is rule #1 in how not to mess up an API (If you’re interested, these are two great API talks Designing and Evaluating Reusable Components by Casey Muratori and Hourglass Interfaces for C++ APIs by Stefanus Du Toit).

For Vulkan, when creating our VkDevice object, we specifically enable the features we want to use from the device upfront. The easy way to do this is to first call vkGetPhysicalDeviceFeatures, and then pass the result of this into our create device call, enabling all features that the device supports.

When creating our VkDevice, we need to explicitly request which queues we want to use. OpenCL has no real analogous concept to this – the naive comparison is to compare VkQueue’s against cl_command_queue’s, but I’ll show in a later post that this is a wrong conflation. Suffice to say, for our purposes we’ll query for all queues that support compute functionality, as that is almost what OpenCL is doing behind the scenes in the cl_context.

// One of the physical devices in our std::vector
VkPhysicalDevice physicalDevice = ...;

VkPhysicalDeviceFeatures physicalDeviceFeatures;

vkGetPhysicalDeviceFeatures(
    physicalDevice,
    physicalDeviceFeatures);

uint32_t queueFamilyPropertiesCount = 0;

vkGetPhysicalDeviceQueueFamilyProperties(
    physicalDevice,
    &queueFamilyPropertiesCount,
    0);

// Create a temporary std::vector to allow us to query for
// all the queue's our physical device supports.
std::vector<VkQueueFamilyProperties> queueFamilyProperties(
    queueFamilyPropertiesCount);

vkGetPhysicalDeviceQueueFamilyProperties(
    physicalDevice,
    &queueFamilyPropertiesCount,
    queueFamilyProperties.data());

uint32_t numQueueFamiliesThatSupportCompute = 0;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT &
      queueFamilyProperties[i].queueFlags) {
    numQueueFamiliesThatSupportCompute++;
  }
}

// Create a temporary std::vector to allow us to specify all
// queues on device creation
std::vector<VkDeviceQueueCreateInfo> queueCreateInfos(
    numQueueFamiliesThatSupportCompute);

// Reset so we can re-use as an index
numQueueFamiliesThatSupportCompute = 0;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT &
      queueFamilyProperties[i].queueFlags) {
    const float queuePrioritory = 1.0f;

    const VkDeviceQueueCreateInfo deviceQueueCreateInfo = {
        VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO,
        0,
        0,
        i,
        1,
        &queuePrioritory
    };

    queueCreateInfos[numQueueFamiliesThatSupportCompute] =
        deviceQueueCreateInfo;

    numQueueFamiliesThatSupportCompute++;
  }
}

const VkDeviceCreateInfo deviceCreateInfo = {
    VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO,
    0,
    0,
    queueCreateInfos.size(),
    queueCreateInfos.data(),
    0,
    0,
    0,
    0,
    0
 };

VkDevice device;
if (VK_SUCCESS != vkCreateDevice(
    physicalDevice,
    &deviceCreateInfo,
    0,
    &device)) {
  // ... error!
}

Vulkan’s almost legendary verbosity strikes here – we’re having to write a lot more code than the equivalent in OpenCL to get an almost analogous handle. The plus here is that for the Vulkan driver, it can do a lot more upfront allocations because a much higher proportion of its state is known at creation time – that is the fundamental approach of Vulkan, we are trading upfront verbosity for a more efficient application overall.

Ok – so we’ve now got the API to the point where we can think about actually using the plethora of hardware available from these APIs! Stay tuned for the next in the series where I’ll cover porting from OpenCL’s cl_command_queue to Vulkan’s VkQueue.