28 Oct

Slides from my Khronos Munich Chapter talk

I gave a talk on Friday 13th of October 2017 at the Khronos Munich Chapter titled ‘OpenCL to Vulkan: A Porting Guide’. I covered how to port from the OpenCL API to the Vulkan API, some common problems our customers have faced, and how to fix them. The slides are available here.

The talk covered some of the major pitfalls our customers have had in porting OpenCL applications to Vulkan, and also briefly covered the work we did in collaboration with Google and Adobe – clspv.

I hope the slide deck is useful to those of you who couldn’t attend in person.

29 Jun

OpenCL -> Vulkan: A Porting Guide (#3)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is a series of blog posts on how to port from OpenCL to Vulkan:

  1. OpenCL -> Vulkan: A Porting Guide (#1)
  2. OpenCL -> Vulkan: A Porting Guide (#2)

In this post, we’ll cover the different queue synchronization mechanisms in OpenCL and Vulkan.

clFinish vs vkWaitForFences

In the previous post I explained that an OpenCL queue (cl_command_queue) was an amalgamation of two distinct concepts:

  1. A collection of workloads to run on some hardware
  2. A thing that will run various workloads and allow interactions between them

Whereas Vulkan uses a VkCommandBuffer for 1, and a VkQueue for 2.

One common synchronization users want to do is let a queue execute a bunch of work, and wait for all that work to be done.

In OpenCL, you can wait on all previously submitted commands to a queue by using clFinish.

cl_command_queue queue; // previously created

// submit work to the queue
if (CL_SUCCESS != clFinish(queue)) {
  // ... error!
}

In Vulkan, because a queue is just a thing to run workloads on, we instead have to wait on the command buffer itself to complete. This is done via a VkFence which is specified when submitting work to a VkQueue.

VkCommandBuffer commandBuffer; // previously created
VkFence fence; // previously created

// submit work to the commandBuffer

VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    fence)) {
  // ... error!
}

if (VK_SUCCESS != vkWaitForFences(
    device,
    1,
    &fence,
    VK_TRUE,
    UINT64_MAX)) {
  // ... error!
}

One thing to note is that you can wait on a Vulkan queue to finish all submitted workloads, but remember the difference between Vulkan queues and OpenCL queues. Vulkan queue’s are retrieved from a device. If multiple parts of your code (including third party libraries) retrieve the same Vulkan queue and are executing workloads on it, you will end up waiting for someone else’s work to complete.

TL;DR – waiting on a queue in Vulkan is not the same as OpenCL.

Dependencies within a cl_command_queue / VkCommandBuffer

Both OpenCL and Vulkan have mechanisms to ensure a command will only begin executing once another command has completed.

Firstly, remember that an OpenCL command queue by default will be in order. What this means is that by default when you submit commands into an OpenCL command queue each command will only begin executing once the preceding command has completed. While this isn’t ideal in a number of situations for performance, it is advantageous for users to get up and running in a safe and quick manner.

OpenCL also allows command queue’s to be out of order. This means that commands submitted to a queue are guaranteed to be dispatched in order but that they may run concurrently and/or complete out of order.

Using an out of order OpenCL queue, to get commands to wait on other commands before beginning executing, you use a cl_event to create a dependency between both the commands.

cl_buffer bufferA, bufferB, bufferC; // previously created

cl_event event;

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    bufferA,
    bufferB,
    0,
    0,
    42,
    0,
    nullptr,
    &event)) {
  // ... error!
}

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    bufferB,
    bufferC,
    0,
    0,
    42,
    1,
    &event,
    nullptr)) {
  // ... error!
}

We can guarantee that if queue above was an out of order queue, the commands would still be executed in order because we expressed the dependency between both commands.

In Vulkan queues are out of order. There is also no exact matching mechanism to get two arbitrary commands to depend on one another. Vulkan relies on more knowledge of what you are actually trying to do to create the right kind of synchronization between commands.

The easiest (and in no way more performant) way to map OpenCL code with an event dependency between two commands, or if the OpenCL queue was created in order, is to have separate Vulkan command buffers for each command. While this might seem crude, it’ll allow you to use another of Vulkan’s synchronization mechanisms to solve the problem – the semaphore.

VkBuffer bufferA, bufferB, bufferC; // previously created
VkCommandBuffer commandBuffer1; // previously created
VkCommandBuffer commandBuffer2; // previously created

VkSemaphoreCreateInfo semaphoreCreateInfo = {
  VK_STRUCTURE_TYPE_SEMAPHORE_CREATE_INFO,
  nullptr,
  0
};

VkSemaphore semaphore;

if (VK_SUCCESS != vkCreateSemaphore(
    device,
    &semaphoreCreateInfo,
    nullptr,
    &semaphore)) {
  // ... error!
}

VkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer1,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer1, bufferA, bufferB, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer1)) {
  // ... error!
}
VkSubmitInfo submitInfo1 = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer1,
  1,
  &semaphore,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo1,
    nullptr)) {
  // ... error!
}

VkPipelineStageFlags pipelineStageFlags =
    VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer2,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer2, bufferB, bufferC, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer2)) {
  // ... error!
}

VkSubmitInfo submitInfo2 = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  1,
  &semaphore,
  &pipelineStageFlags,
  1,
  &commandBuffer2,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo2,
    nullptr)) {
  // ... error!
}

A Vulkan semaphore allows you to express dependencies between command buffers. So by placing each command into a command buffer we can use a semaphore between these command buffers to emulate the OpenCL behaviour of in order queues and arbitrary command dependencies.

As with everything in Vulkan – the way to get performance is to explain to the driver exactly what you intend to do. In our example where we are copying data from buffer A -> buffer B -> buffer C above, we are basically creating a dependency on our usage of buffer B. The copy from buffer B -> buffer C cannot begin until the copy from buffer A -> buffer B has complete. So Vulkan gives us the tools to tell the driver about this dependency explicitly, and we can use them within a single command buffer.

The most analogous approach to the OpenCL example is to use a Vulkan event to encode the dependency.

VkEventCreateInfo eventCreateInfo = {
  VK_STRUCTURE_TYPE_EVENT_CREATE_INFO,
  nullptr,
  0
};

VkEvent event;

if (VK_SUCCESS != vkCreateEvent(
    device,
    &eventCreateInfo,
    nullptr,
    &event)) {
  // ... error!
}

Note that we create the event explicitly with Vulkan, unlike in OpenCL where any clEnqueue* command has an optional out_event parameter as the last parameter.

VkBuffer bufferA, bufferB, bufferC; // previously created
VkCommandBuffer commandBuffer; // previously created

VkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer, bufferA, bufferB, 1, &bufferCopy);

vkCmdSetEvent(
    commandBuffer, 
    event,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT);

VkMemoryBarrier memoryBarrier = {
  VK_STRUCTURE_TYPE_MEMORY_BARRIER,
  nullptr,
  VK_ACCESS_MEMORY_WRITE_BIT,
  VK_ACCESS_MEMORY_READ_BIT
};

vkCmdWaitEvents(
    commandBuffer,
    1,
    &event,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    1,
    &memoryBarrier,
    0,
    nullptr,
    0,
    nullptr)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer, bufferB, bufferC, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer)) {
  // ... error!
}
VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    nullptr)) {
  // ... error!
}

So to do a similar thing to OpenCL’s event chaining semantics we:

  1. add our buffer A -> buffer B copy command
  2. set an event that will trigger when all previous commands are complete, in our case the current set of all previous commands is the one existing copy buffer command
  3. wait for the previous event to complete, specifying that all memory operations that performed a write before this wait must be resolved, and that all read operations after this event can read them
  4. add our buffer B -> buffer C copy command

Now we can be even more explicit with Vulkan and specifically use VK_ACCESS_TRANSFER_READ_BIT and VK_ACCESS_TRANSFER_WRITE_BIT – but I’m using the much more inclusive VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT to be clear what OpenCL will be doing implicitly for you as a user.

Dependencies between multiple cl_command_queue’s / VkCommandBuffer’s

When synchronizing between multiple cl_command_queue’s in OpenCL we use the exact same mechanism as with one queue.

cl_buffer bufferA, bufferB, bufferC; // previously created
cl_command_queue queue1, queue2; // previously created

cl_event event;

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue1,
    bufferA,
    bufferB,
    0,
    0,
    42,
    0,
    nullptr,
    &event)) {
  // ... error!
}

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue2,
    bufferB,
    bufferC,
    0,
    0,
    42,
    1,
    &event,
    nullptr)) {
  // ... error!
}

The command queue queue2 will not begin executing the copy buffer command until the first command queue queue1 has completed its execution. Having the same mechanism for creating dependencies within a queue and outwith a queue is a very nice thing from a user perspective – there is one true way to create a synchronization between commands in OpenCL.

In Vulkan, when we are wanting to create a dependency between two VkCommandBuffer’s the easiest way is to use the semaphore approach I showed above. You could also use a VkEvent that is triggered at the end of one command buffer and waited on at the beginning of another. If you want to amortize the cost of doing multiple submits to the same queue, then use the event approach.

You can also use both of these mechanisms to create dependencies between multiple Vulkan queues. Remember that a Vulkan queue can be thought of as an exposition of some physical concurrency in the hardware, or in other words, running things on two distinct queues concurrently can lead to a performance improvement.

I recommend using a semaphore as the mechanism to encode dependencies between queues for the most part as it is simpler to get right.

The main place where using the event approach is when you have a long command buffer, where after only a few commands you can unblock the concurrently runnable queue to begin execution. In this case you’d be better using an event as that will enable the other queue to begin executing much earlier than would previously be possible.

clEnqueueBarrierWithWaitList vs vkCmdPipelineBarrier

Both OpenCL and Vulkan have a barrier that acts as a memory and execution barrier. When you have a pattern whereby you have N commands that must have completed execution before another M commands begin, a barrier is normally the answer.

// N commands before here...

if (CL_SUCCESS != clEnqueueBarrierWithWaitList(
    queue,
    0,
    nullptr,
    nullptr)) {
  // ... error!
}

// M commands after here will only begin once
// the previous N commands have completed!

And the corresponding Vulkan:

VkMemoryBarrier memoryBarrier = {
  VK_STRUCTURE_TYPE_MEMORY_BARRIER,
  nullptr,
  VK_ACCESS_MEMORY_WRITE_BIT,
  VK_ACCESS_MEMORY_READ_BIT
};

// N commands before here...

vkCmdPipelineBarrier(
    commandBuffer,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    1,
    &memoryBarrier,
    0,
    nullptr,
    0,
    nullptr)) {
  // ... error!
}

// M commands after here will only begin once
// the previous N commands have completed!

What’s next?

After this monstrous dive into porting OpenCL’s synchronization mechanisms to Vulkan, in the next post we’ll look at the differences between OpenCL’s kernels and Vulkan’s pipelines – stay tuned!

16 Jun

OpenCL -> Vulkan: A Porting Guide (#2)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is a series of blog posts on how to port from OpenCL to Vulkan:

  1. OpenCL -> Vulkan: A Porting Guide (#1)

In this post, we’ll cover porting from OpenCL’s cl_command_queue to Vulkan’s VkQueue.

cl_command_queue -> VkCommandBuffer and VkQueue

OpenCL made a poor choice when cl_command_queue was designed. A cl_command_queue is an amalgamation of two very distinct things:

  1. A collection of workloads to run on some hardware
  2. A thing that will run various workloads and allow interactions between them

Vulkan broke this into the two constituent parts, for 1. we have a VkCommandBuffer, an encapsulation of one or more commands to run on a device. For 2. we have a VkQueue, the thing that will actually run these commands and allow us to synchronize on the result.

Without diving too deeply, Vulkan’s approach allows for a selection of commands to be built once, and then run multiple times. For a huge number of compute workloads we run on datasets, we’re running the same set of commands thousands of times – and Vulkan allows us to amortise the cost of building up this collection of commands to run.

Back to OpenCL, we use clCreateCommandQueue (for pre 2.0) / clCreateCommandQueueWithProperties to create this amalgamated ‘collection of things I want you to run and a way of running them’. We’ll enable CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE as that is the behaviour of a Vulkan VkQueue (although remember that not all OpenCL devices actually support out of order queues – I’m doing this to allow the mental mapping of how Vulkan executes command buffers on queues to bake into your mind).

cl_queue_properties queueProperties[3] = {
    CL_QUEUE_PROPERTIES,
    CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
    0
};

cl_command_queue queue = clCreateCommandQueueWithProperties(
    context,
    device,
    queueProperties,
    &errorcode);

if (CL_SUCCESS != errorcode) {
 // ... error!
}

The corresponding object in Vulkan is the VkQueue – which we get from the device, rather than creating as OpenCL does. This is because a queue in Vulkan is more like a physical aspect of the device, rather than some software construct – this isn’t mandated in the specification, but its a useful mental model to adopt when thinking about Vulkan’s queues.

Remember that when we created our VkDevice we requested which queue families we wanted to use with the device? Now to actually get a queue that supports compute, we have to choose one of the queue family indices that supported compute, and get the corresponding VkQueue from that queue family.

VkQueue queue;

uint32_t queueFamilyIndex = UINT32_MAX;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT & queueFamilyProperties[i].queueFlags) {
    queueFamilyIndex = i;
    break;
  }
}

if (UINT_MAX == queueFamilyIndex) {
  // ... error!
}

vkGetDeviceQueue(device, queueFamilyIndex, 0, &queue);

clEnqueue* vs vkCmd*

To actually execute something on a device, OpenCL uses commands that begin with clEnqueue* – this command will enqueue work onto a command queue and possibly begin execution it. Why possibly? OpenCL is utterly vague on when commands actually begin executing. The specification states that a call to clFlush, clFinish, or clWaitForEvents on an event that is being signalled by a previously enqueued command on a command queue will guarantee that the device has actually begun executing. It is entirely valid that an implementation begin executing work when the clEnqueue* command is called, and equally valid that the implementation delays until a bunch of clEnqueue* commands are in the queue and the corresponding clFlush/clFinish/clWaitForEvents is called.

cl_mem src, dst; // Two previously created buffers

cl_event event;
if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    src,
    dst,
    0, // src offset
    0, // dst offset
    42, // size in bytes to copy
    0,
    nullptr,
    &event)) {
  // ... error!
}

// If we were going to enqueue more stuff on the command queue,
// but wanted the above command to definitely begin execution,
// we'd call flush here.
if (CL_SUCCESS != clFlush(queue)) {
  // ... error!
}

// We could either call finish...
if (CL_SUCCESS != clFinish(queue)) {
  // ... error!
}

// ... or wait for the event we used!
if (CL_SUCCESS != clWaitForEvents(1, &event)) {
  // ... error!
}

In contrast, Vulkan requires us to submit all our commands into a VkCommandBuffer. First we need to create the command buffer.

VkCommandPoolCreateInfo commandPoolCreateInfo = {
  VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO,
  0,
  0,
  queueFamilyIndex
};

VkCommandPool commandPool;

if (VK_SUCCESS != vkCreateCommandPool(
    device,
    &commandPoolCreateInfo,
    0,
    &commandPool)) {
  // ... error!
}

VkCommandBufferAllocateInfo commandBufferAllocateInfo = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO,
  0,
  commandPool,
  VK_COMMAND_BUFFER_LEVEL_PRIMARY,
  1 // We are creating one command buffer.
};

VkCommandBuffer commandBuffer;

if (VK_SUCCESS != vkAllocateCommandBuffers(
    device,
    &commandBufferAllocateInfo,
    &commandBuffer)) {
  // ... error!
}

Now we have our command buffer with which we can queue up commands to execute on a Vulkan queue.

VkBuffer src, dst; // Two previously created buffers

VkCommandBufferBeginInfo commandBufferBeginInfo = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO,
  0,
  VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT,
  0
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

vkCmdCopyBuffer(commandBuffer, src, dst, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer)) {
  // ... error!
}

VkFenceCreateInfo fenceCreateInfo = {
  VK_STRUCTURE_TYPE_FENCE_CREATE_INFO,
  0,
  0
};

VkFence fence;

if (VK_SUCESS != VkFenceCreateInfo(
    device,
    &fenceCreateInfo,
    0,
    &fence)) {
  // ... error!
}

VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    fence)) {
  // ... error!
}

// We can either wait on our commands to complete by fencing...
if (VK_SUCCESS != vkWaitForFences(
    device,
    1,
    &fence,
    VK_TRUE,
    UINT64_MAX)) {
  // ... error!
}

// ... or waiting for the entire queue to have finished...
if (VK_SUCCESS != vkQueueWaitIdle(queue)) {
  // ... error!
}

// ... or even for the entire device to be idle!
if (VK_SUCCESS != vkDeviceWaitIdle(device)) {
  // ... error!
}

Vulkan gives us many more ways to synchronize on host for when we are complete with our workload. We can specify a VkFence to our queue submission to wait on one of more command buffers in that submit, we can wait for the queue to be idle, or even wait for the entire device to be idle! Fences and command buffers can be reused by calling VkResetFences and VkResetCommandBuffer respectively – note that the command buffer can be reused for an entirely different set of commands to be executed. If you wanted to resubmit the exact same command buffer, you’d have to remove VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT flag in the VkCommandBufferBeginInfo struct above.

So a crucial thing to note here – synchronizing on a cl_command_queue is similar to a VkQueue, but the mechanisms are not identical.

We’ll cover these queue synchronization mechanisms in more detail in the next post in the series.

06 Jun

OpenCL -> Vulkan: A Porting Guide (#1)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is part 1 of my guide for how things map between the APIs!

cl_platform_id -> VkInstance

In OpenCL, the first thing you do is get the platform identifiers (using clGetPlatformIDs).

// We do not strictly need to initialize this to 0 (as it'll
// be set by clGetPlatformIDs), but given a lot people do
// not check the error code returns, it's safer to 0
// initialize.
cl_uint numPlatforms = 0;
if (CL_SUCCESS != clGetPlatformIDs(
    0,
    nullptr,
    &numPlatforms)) {
  // ... error!
}

std::vector<cl_platform_id> platforms(numPlatforms);

if (CL_SUCCESS != clGetPlatformIDs(
    platforms.size(),
    platforms.data(),
    nullptr)) {
  // ... error!
}

Each cl_platform_id is a handle into an individual vendors OpenCL driver – if you had an AMD and NVIDIA implementation of OpenCL on your system, you’d get two cl_platform_id’s returned.

Vulkan is different here – instead of getting one or more handles to individual vendors implementations, we instead create a single VkInstance (via vkCreateInstance).

const VkApplicationInfo applicationInfo = {
  VK_STRUCTURE_TYPE_APPLICATION_INFO,
  0,
  "MyAwesomeApplication",
  0,
  "",
  0,
  VK_MAKE_VERSION(1, 0, 0)
};
 
const VkInstanceCreateInfo instanceCreateInfo = {
  VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO,
  0,
  0,
  &applicationInfo,
  0,
  0,
  0,
  0
};
 
VkInstance instance;
if (VK_SUCCESS != vkCreateInstance(
    &instanceCreateInfo,
    0,
    &instance)) {
  // ... error!
}

This single instance allows us to access multiple vendor implementations of the Vulkan API through a single object.

cl_device_id -> VkPhysicalDevice

In OpenCL, you can query one or more cl_device_id’s from each cl_platform_id that we previously queried (via clGetDeviceIDs). When querying for a device, we can specify a cl_device_type, where you can basically ask the driver to give you its default device (normally a GPU) or for a specific device type. We’ll use CL_DEVICE_TYPE_ALL, in that we are instructing the driver to return all the devices it knows about, and we can choose from them.

cl_uint numDevices = 0;

for (cl_uint i = 0; i < platforms.size(); i++) {
  // We do not strictly need to initialize this to 0 (as it'll
  // be set by clGetDeviceIDs), but given a lot people do
  // not check the error code returns, it's safer to 0
  // initialize.
  cl_uint numDevicesForPlatform = 0;

  if (CL_SUCCESS != clGetDeviceIDs(
      platforms[i],
      CL_DEVICE_TYPE_ALL,
      0,
      nullptr,
      &numDevicesForPlatform)) {
    // ... error!
  }

  numDevices += numDevicesForPlatform;
}

std::vector<cl_device_id> devices(numDevices);

// reset numDevices as we'll use it for our insertion offset
numDevices = 0;

for (cl_uint i = 0; i < platforms.size(); i++) {
  cl_uint numDevicesForPlatform = 0;

  if (CL_SUCCESS != clGetDeviceIDs(
      platforms[i],
      CL_DEVICE_TYPE_ALL,
      0,
      nullptr,
      &numDevicesForPlatform)) {
    // ... error!
  }

  if (CL_SUCCESS != clGetDeviceIDs(
      platforms[i],
      CL_DEVICE_TYPE_ALL,
      numDevicesForPlatform,
      devices.data() + numDevices,
      nullptr)) {
    // ... error!
  }

  numDevices += numDevicesForPlatform;
}

The code above is a bit of a mouthful – but it is the easiest way to get every device that the system knows about.

In contrast, since Vulkan gave us a single VkInstance, we query that single instance for all of the VkPhysicalDevice’s it knows about (via vkEnumeratePhysicalDevices). A Vulkan physical device is a link to the actual hardware that the Vulkan code is going to execute on.

uint32_t physicalDeviceCount = 0;

if (VK_SUCCESS != vkEnumeratePhysicalDevices(
    instance,
    &physicalDeviceCount,
    0)) {
  // ... error!
}

std::vector<VkPhysicalDevice> physicalDevices(physicalDeviceCount);

if (VK_SUCCESS != vkEnumeratePhysicalDevices(
    instance,
    &physicalDeviceCount,
    physicalDevices.data())) {
  // ... error!
}

A prominent API design fork can be seen between vkEnumeratePhysicalDevices and clGetDeviceIDs – Vulkan reuses the integer return parameter to the function (the parameter that lets you query the number of physical devices present) to also pass into the driver the number of physical devices we want filled out. In contrast, OpenCL uses an extra parameter for this. These patterns are repeated throughout both APIs.

cl_context -> VkDevice

Here is where it gets trickier between the APIs. OpenCL has a notion of a context – you can think of this object as your way as the user to view and interact with what the system is doing. OpenCL allows multiple device’s that belong to a single platform to be shared within a context. In contrast, Vulkan is fixed to having a single physical device per it’s ‘context’, which Vulkan calls a VkDevice.

To make the porting easier, and because in all honesty I’ve yet to see any real use-case or benefit from having multiple OpenCL devices in a single context, we’ll make our OpenCL code create it’s cl_context using a single cl_device_id (via clCreateContext).

// One of the devices in our std::vector
cl_device_id device = ...;

cl_int errorcode;

cl_context context = clCreateContext(
    nullptr,
    1,
    &device,
    nullptr,
    nullptr,
    &errorcode);

if (CL_SUCCESS != errorcode) {
  // ... error!
}

The above highlights the single biggest travesty in the OpenCL API – the error code has changed from being something returned from the API call, to an optional pointer parameter at the end of the signature. In API design, I’d say this is rule #1 in how not to mess up an API (If you’re interested, these are two great API talks Designing and Evaluating Reusable Components by Casey Muratori and Hourglass Interfaces for C++ APIs by Stefanus Du Toit).

For Vulkan, when creating our VkDevice object, we specifically enable the features we want to use from the device upfront. The easy way to do this is to first call vkGetPhysicalDeviceFeatures, and then pass the result of this into our create device call, enabling all features that the device supports.

When creating our VkDevice, we need to explicitly request which queues we want to use. OpenCL has no real analogous concept to this – the naive comparison is to compare VkQueue’s against cl_command_queue’s, but I’ll show in a later post that this is a wrong conflation. Suffice to say, for our purposes we’ll query for all queues that support compute functionality, as that is almost what OpenCL is doing behind the scenes in the cl_context.

// One of the physical devices in our std::vector
VkPhysicalDevice physicalDevice = ...;

VkPhysicalDeviceFeatures physicalDeviceFeatures;

vkGetPhysicalDeviceFeatures(
    physicalDevice,
    physicalDeviceFeatures);

uint32_t queueFamilyPropertiesCount = 0;

vkGetPhysicalDeviceQueueFamilyProperties(
    physicalDevice,
    &queueFamilyPropertiesCount,
    0);

// Create a temporary std::vector to allow us to query for
// all the queue's our physical device supports.
std::vector<VkQueueFamilyProperties> queueFamilyProperties(
    queueFamilyPropertiesCount);

vkGetPhysicalDeviceQueueFamilyProperties(
    physicalDevice,
    &queueFamilyPropertiesCount,
    queueFamilyProperties.data());

uint32_t numQueueFamiliesThatSupportCompute = 0;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT &
      queueFamilyProperties[i].queueFlags) {
    numQueueFamiliesThatSupportCompute++;
  }
}

// Create a temporary std::vector to allow us to specify all
// queues on device creation
std::vector<VkDeviceQueueCreateInfo> queueCreateInfos(
    numQueueFamiliesThatSupportCompute);

// Reset so we can re-use as an index
numQueueFamiliesThatSupportCompute = 0;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT &
      queueFamilyProperties[i].queueFlags) {
    const float queuePrioritory = 1.0f;

    const VkDeviceQueueCreateInfo deviceQueueCreateInfo = {
        VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO,
        0,
        0,
        i,
        1,
        &queuePrioritory
    };

    queueCreateInfos[numQueueFamiliesThatSupportCompute] =
        deviceQueueCreateInfo;

    numQueueFamiliesThatSupportCompute++;
  }
}

const VkDeviceCreateInfo deviceCreateInfo = {
    VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO,
    0,
    0,
    queueCreateInfos.size(),
    queueCreateInfos.data(),
    0,
    0,
    0,
    0,
    0
 };

VkDevice device;
if (VK_SUCCESS != vkCreateDevice(
    physicalDevice,
    &deviceCreateInfo,
    0,
    &device)) {
  // ... error!
}

Vulkan’s almost legendary verbosity strikes here – we’re having to write a lot more code than the equivalent in OpenCL to get an almost analogous handle. The plus here is that for the Vulkan driver, it can do a lot more upfront allocations because a much higher proportion of its state is known at creation time – that is the fundamental approach of Vulkan, we are trading upfront verbosity for a more efficient application overall.

Ok – so we’ve now got the API to the point where we can think about actually using the plethora of hardware available from these APIs! Stay tuned for the next in the series where I’ll cover porting from OpenCL’s cl_command_queue to Vulkan’s VkQueue.

29 Sep

Introducing YARI-V – an experiment on SPIR-V compression

SPIR-V is a simple binary intermediate language used for graphics shaders and compute kernels. Wearing my work hat (I work at Codeplay Software Ltd.) I have been contributing to the SPIR-V specification since 2014 as one of the authors. SPIR-V’s primary goals are (as according to me):

  • Have a regular binary structure.
  • Be easily extendable.
  • Be easy to validate for correctness.
  • Be easy to produce from compiler toolchains.
  • Be easy to consume in tools and drivers.

To this end, one of the things that SPIR-V has not prioritised is the size of the resultant binaries. The awesome @aras_p wrote a great summary of the problem (and his tool SMOL-V) on his blog – SPIR-V Compression. The SMOL-V tool is a single C++ header/single C++ source file.

I’m a big fan of single C header libraries, and was curious if I could write a similar tool to his own, written in C, but try to use my knowledge of SPIR-V to get me a better compression ratio. In my previous blog posts ‘spirv-stats – a tool to output statistics of your SPIR-V shader modules‘ and ‘spirv-stats update – exposing more information‘ I tried to get an in-depth look into what is taking up the most space in the SPIR-V shaders that @aras_p was using for testing.

Then, I begun writing my own tool for compressing SPIR-V shaders that I’m calling YARI-V (a yari is a type of Japanese spear, which seemed appropriate as a sister encoding to SPEAR-V).

In the remainder of this post I’ll walk you through the steps I took to compress the SPIR-V shaders that @aras_p was using for testing, and compare and contrast the result of my own library YARI-V against SMOL-V.

Test Set

I didn’t have handy access to some real world shaders like @aras_p had for his SMOL-V tool – so I simply used the 341 shaders he uses to test SMOL-V against to test YARI-V against. The total size of the uncompressed shaders is 4868.47 kilobytes, and we’ll use a percentage on this size when evaluating the compression attempts that were made.

Varint Encoding

The first thing I thought to do was use the varint encoding used in Google’s Protocol Buffers for everything. For the uninitiated, SPIR-V is word based – everything is held in 32 bit values. IDs are in the range [1..N), and ID’s should start at 1 and increment from there as more IDs are required. This means that for small shaders, most IDs in use are going to be small unsigned integer numbers. Let’s take a look at the OpDecorate instruction for an example:

word count opcode <id> target decoration literal*
bytes 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

As we can see, the opcode is made up of:

  • two bytes for the word count
  • two bytes for the opcode
  • four bytes for the <id> to target this decoration with
  • four bytes for the decoration itself (an enumeration of values in the range [0..N))
  • four bytes for each optional literal (some decorations can take other values)

What I did was take the word count and varint encode that (this value is normally very low for opcodes) – the only opcodes that could have a word count greater than 127 (the magic cutoff to fit within 1 byte using varint encoding) are the ones that take strings (like OpString, OpName, OpMemberName). This meant that word count was taking 1 byte instead of 2 in most cases.

Next, I varint encoded the opcode. Most of the opcodes we use are below the 127 cutoff for varint encoding, so we can encode this as 1 byte instead of 2 again. The worst case to fit within 2 bytes using varint encoding is 16383, and our maximum opcode at present is in the 4000 range, so we shouldn’t ever require more bytes than the original encoding by using varint.

Next, the <id>. In small shaders it would be normal for all <id>’s to be less than 127, but even in large shaders most <id>’s are likely to be lower than the 2 byte boundary for varint – the value 16383. Given that <id> took 4 bytes all the time previously, we are saving at least 2 bytes in nearly all the cases we care about.

The decoration used in OpDecorate currently has [0..44) possible values, so this will always fit inside 1 byte of our varint encoding.

And any literal used by the decoration we’ll just varint and hope that it’ll be worth it.

After doing this for all opcodes in the SPIR-V shaders, we reduced our SPIR-V shader size to:

2279.97 kilobytes (46.8%)

So a pretty healthy start for reducing the size of the binaries!

OpLabel <-> OpNop

So after varint encoding, the next thing I looked at was the output from my spirv-stats tool. It showed that OpLabel was being used 9915 times in our shaders. OpLabel’s opcode value is 248 – which means it requires 2 bytes when varint encoded. So I decided to find another opcode whose value was less than 127 and could thus fit within a 1 byte varint value, that wasn’t being used within our shaders. OpNop has the value 0, is used no times within our shaders, so I decided to swap the values of these during encoding, and then swap them back during decoding.

The next thing I noticed was that OpLabel has a constant word count – the number of words it takes is the same for every use of the opcode. Given that this is constant, it means we can not encode the value, and simply infer the constant value of the word count during decoding.

2260.61 kilobytes 46.43%

This reduced the size by about 19 kilobytes.

Finding Moar Things to Swap

After realising that swapping <id>’s with greater than 127 value with non-used or little-used smaller than 128 values worked, I decided to go through the next set of most used opcodes (as found from spirv-stats) and swap them, and where possible not encode the word count of the opcode if it was a constant.

I first swapped OpFMul <-> OpSourceContinued, and OpFadd with OpSource:

2246.63 kilobytes 46.15%

Then I swapped OpBranch and OpSourceExtension, not encoding the word count of OpBranch:

2236.58 kilobytes 45.94%

Then I swapped OpFSub and OpUndef, not encoding OpFSub’s word count and also not encoding OpFAdd and OpFMul’s word count too:

2217.26 kilobytes 45.54%

In total another 43 kilobytes shaved off the size!

Delta Encoding

SPIR-V uses a compiler intermediate form known as Single-Static-Assignment (SSA for short) which means the results of opcodes are assigned to an <id> once, and that <id> is never reassigned to. This means that once we go over the 127 value boundary for an <id>, we are going to require 2 bytes for every subsequent <id> to be encoded.

For the most part, <id>’s will be linearly increasing the length of the program, EG. for the current opcode we can be quite confident that the previous opcode had the <id> of our <id> – 1. Given this fact, I delta encoded our <id> to the previous known <id>. There was a problem though – what if the previous <id> used was actually bigger than our <id>? This would result in the subtraction creating a large unsigned integer number, which would take 5 bytes to encode! To get round this, I used a lovely little bit twiddling hack called zig-zag encoding (used in Google’s Protocol Buffers, but explained really well here). Zig-zag encoding allows for all integers in the range [-64..64) to be encoded using one byte when combined with our varint encoding, meaning that even if the previous <id> was actually larger than our own, we would still hopefully be able to encode the delta from it to our own <id> in 1 or 2 bytes (rather than a worst case of 5 bytes).

I also thought I’d try delta encoding our types separately. In general the <id>’s assigned to types are close to each other in the SPIR-V shaders, because types are all declared in one section at the beginning of the shaders. So I thought by delta encoding the types I’d also get a nice little compression.

2111.88 kilobytes 43.38%

So doing this shaved a lovely 106 kilobytes off of YARI-V encoded size.

Never Encode a Constant Word Count

I’d already shown that not encoding the word count where possible would save us at least 1 byte per opcode we could do this for, so I did a pass over all the opcodes in SPIR-V to not encode the word count for all opcodes where the word count was a constant.

2081.48 kilobytes 42.75%

This shaved a further 30 kilobytes off our YARI-V encoded size.

Fake Opcodes

I was a little disappointed that never encoding the constant word count only knocked 30 kilobytes off of our encoded size – then I realised, the most used opcodes in our SPIR-V shaders are all variable length (as per the specification). But are they really? I added some output to spirv-stats to show when OpLoad and OpStore had the optional addition Memory Access literal – and it turns out exactly 0 of our OpLoad’s and OpStore’s used this! So for our purposes, OpLoad and OpStore had a constant word count, they just didn’t know it.

What I did was split OpLoad into two encoding, OpLoad, and a new fake opcode called OpLoadWithMemoryAccess. I set the value for OpLoadWithMemoryAccess to above 500 (the largest SPIR-V ID in use at present is in the low 300’s, so I hope this is safe enough for the time being), and then when encoding both our OpLoad and OpLoadWithMemoryAccess opcodes their word counts are constant (4 for OpLoad, 5 for OpLoadWithMemoryAccess). Doing this allowed me to save 1-2 bytes for each use of OpLoad (which accounts for 16% of the opcodes in our SPIR-V shaders!)

2032.74 kilobytes 41.75%

Next I did the same for OpStore, making a new fake opcode OpStoreWithMemoryAccess, and not encoding the now constant word count for OpStore and OpStoreWithMemoryAccess.

2004.19 41.17%

In total shaving 77 kilobytes off of our YARI-V encoded size.

More Decorations

OpDecorate is the third most used opcode with 8.28% of the total opcodes. I added some information to spirv-stats to output how many of the decorations had no literals and how many had one literal (none of the opcodes available today have more than one). 71% of the opcodes have no literals, and 29% have one. So I decided to split OpDecorate into three encodings, one that contains a decoration that has no literals, one that contains a decoration that has exactly one literal, and one that has two or more literals (to future proof the encoder). This allowed me to make all of our uses of OpDecorate have a constant word count, meaning we do not need to encode it. I also swapped these new fake opcodes with OpLine and OpExtension so their <id>’s were less than 127.

1996.46 kilobytes 41.01%

Shaving 8 kilobytes off of the YARI-V encoding.

Moar Member Decorations

Given the success of splitting OpDecorate, I decided to do the same with OpMemberDecorate, which is the sixth most used opcode in our SPIR-V shaders. 90% of the uses of OpMemberDecorate had 1 literal, so I decided I’d split it into three encodings (just like I did with OpDecorate), one that contains a decoration that has no literals, one that contains a decoration that has exactly one literal, and one that has two or more literals. I also swapping these new fake opcodes with OpExtImport and OpMemoryModel.

I also noticed that I wasn’t delta encoding the <id>’s for the new fake OpDecorate or OpMemberDecorate variants, so I did that too.

1967.65 kilobytes 40.42%

All of this resulted in shaving a further 29 kilobytes off of our YARI-V encoding.

(Non) Initialised Variables

I added a check to spirv-stats to see if any of the OpVariables we were declaring had initialisers – 0 of them did. So I added a new encoding for OpVariable that has an initializer, which meant I could skip encoding the word count.

1949.05 kilobytes 40.03%

I then applied the same logic to OpConstant – all of our constants were using one word for the actual constant (all of our constants were 32 bit integers and floats), so I could split out the encoding for OpConstant if it was encoding a 64 bit integer of double into a separate opcode, allowing me to not output the word count of our OpConstant’s.

1938.48 kilobytes 39.82%

Shaving 29 kilobytes off of our YARI-V encoding.

Access Chains

To get a pointer into a composite (say an array or struct) we use OpAccessChain to work out what we want to load. I added some information to spirv-stats to output the number of indices being used with OpAccessChain. 78% were using one index (say indexing into an array), 19% were using two indices (used if you were indexing into an array of structs), and 2% were using three indices.

I decided to split OpAccessChain into four encodings, one that contains one index, one that contains two indices, one that contains three indices, and one for all other index combinations. I also swapped these new fake opcodes with OpExecutionMode, OpCapability and OpTypeVoid)

1919.76 kilobytes 39.43%

Shaving 19 kilobytes off of our YARI-V encoding.

Everyday I’m Shuffling

OpVectorShuffle takes 3.4% of the opcodes in the SPIR-V shader module, but 6% of the size of the module (it’s a lot of bytes per opcode hit).

The first thing I noticed was that OpVectorShuffle was working on at most two vec4’s (the SPIR-V shaders I’m dealing with are used in Vulkan, where 4 element vectors are the maximum). So I decided to split OpVectorShuffle into four encodings; one that contains two components, one that contains three components, one that contains four components, and one for all other component combinations. I also swapped these with gaps in the SPIR-V opcode range at 8, 13 & 18 opcode values

1910.26 kilobytes 39.24%

Only 9 kilobytes shaved, which wasn’t so great. My next observation was that, when shuffling two vec4’s together, the maximum number of states each component literal could be in was 9, in the range [-1..8) – where -1 denotes that we’d want a undefined result in that component of the vector. I checked, and none of our encodings of OpVectorShuffle were using -1, so given that all of our literals are less than 8, we can use 3 bits maximum to encode each literal!. I extended the new OpVectorShuffle encodings I had previously made to encode the literals in at most 2 bytes (1 byte for the two literals case, 2 bytes for the three and four cases).

1892.43 kilobytes 38.87%

I next checked how many of our OpVectorShuffle’s were actually doing a swizzle – EG. they were taking the same vector <id> for both vectors, and were only accessing values from the first vector. A whopping 82% of our OpVectorShuffle’s were doing exactly this, so I added some new fake opcodes for OpVectorSwizzle, using 2 bits to encode each literal (in a swizzle at most 4 elements of a vec4 were being shuffled around, which can be encoded in 2 bits).

1874.17 kilobytes 38.50%

Shaving a cool 36 kilobytes off of our YARI-V encoded size.

Swapshop

I noticed that OpBranchConditional and OpSelectionMerge were being used enough that requiring 2 bytes to encode their opcodes was silly, so I swapped these with the CL-specific OpTypeEvent and OpTypeDeviceEvent for a further 8 kilobyte reduction in our YARI-V encoded size:

1868.01 kilobytes 38.37%

Composing

OpCompositeExtract and OpCompositeConstruct take a decent amount of space in the SPIR-V binary with 7% of the bytes dedicated to them.

I first split OpCompositeExtract into two encodings; one that has exactly one literal, and one for all other cases:

1859.28 kilobytes 38.19%

Then I split split OpCompositeConstruct into four encodings; one that has one constituent, one that has two constituents, one that has three constituents, and one for all other encodings:

1855.46 kilobytes 38.11%

Next, I noticed that OpCompositeExtract was being used mostly to lift a scalar from a vector for some scalar calculation. So I detected when OpCompositeExtract was being used with literals in the range [0..4), and added four encodings of OpCompositeExtract; one that assumes the literal is zero, one that assumes it is one, one that assumes it is two, and one that assumes it is three:

1843.40 kilobytes 37.86%

Relaxing Precisely While Decorating

The most used decoration of OpDecorate was RelaxedPrecision – with 66% of the 23770 uses of the opcode encoding that. So I added a new fake opcode for OpDecorateRelaxedPrecision, allowing me to not actually encode the decoration for RelaxedPrecision and skip the unnecessary byte.

1828.03 kilobytes 37.55%

I then used the same logic on OpMemberDecorate. The most used decoration with OpMemberDecorate was for Offset – accounting for 90% of the 14332 uses of the opcode. I added a new fake opcode for OpMemberDecorateOffset, to skip outputting the decoration in this most used case.

1816.17 kilobytes 37.30%

And with this I was really excited because I’d finally beaten @aras_p‘s SMOL-V (his was taking 1837.88 kilobytes for the shaders).

The Big Plot Twist

One thing I hadn’t been keeping an eye on (showing my newbieness to all things compression) was the compression ratio when passing YARI-V into something like zstd. SMOL-V is primarily a data filtering algorithm – it runs on a SPIR-V shader to create SMOL-V such that the SMOL-V is much more easily compressible than the SPIR-V was. My mistake was I was thinking of YARI-V solely as a compression format, and not as a filtering algorithm.

When I tested running zstd at level 20 encoding on YARI-V versus SMOL-V, YARI-V was taking 440 kilobytes compressed to SMOL-V’s 348 kilobytes! Even though the encoding of YARI-V was smaller than SMOL-V’s, SMOL-V was clearly filtering the data such that it made the compressors life easier.

I had to now work out how to increase repetition in my YARI-V encoding to help out compressors.

Delta Encoding More Things

I had previously only delta encoded the result <id> of my opcodes – but <id>’s are used in the body of the opcodes to. I looked at our three most used opcodes and started there.

For OpLoad and OpStore, I delta encoded the <id> that they were loading/storing from/to. This should result in a 1 byte encoding, as most OpLoad’s and OpStore’s are using the result <id> from an OpAccessChain to work out where to load from, and the access chain is usually the instruction immediately before the load or store.

1810.70 kilobytes 37.19%

Unvarinting Things

By using the varint encoding everywhere I was being a little over zealous with the use of varints. For example, if we were declaring an OpConstant that was a floating point value, as long as the floating point constant was not denormal it would always take 5 bytes to encode instead of the 4 bytes it would have taken if we hadn’t used varint encoding. So I added some logic to detect constants that had any bits set that would result in a 4 byte or larger varint encoding, and just memcpy’d these into the YARI-V encoding.

It turns out that 47% of our constants fitted this pattern, which saved one byte per constant encoded.

1805.71 kilobytes 37.09%

The Case of the Mistaken Delta Encoded Types

I still wasn’t anywhere near where I needed to be when compressed, and I was having trouble working out why I was still so far away. So I did some analysis of number of bytes taken up by delta encoding both our <id>’s and types. It turned out that our types were being delta encoded to 1 or 2 bytes which seemed reasonable at first glance. But then looking more closely, I realised that, since types are declared at the beginning of the SPIR-V shader modules, they mostly had low <id>’s assigned to them. In most cases the <id>’s were less than 127 for our types. This meant that instead of delta encoding them which was giving us roughly 50/50 for 1/2 byte encodings, if I just used varint encoding without delta encoding too, I’m getting nearly 95% of types encoded in 1 byte.

1675.96 kilobytes 34.42%

Shaving a whopping 125 kilobytes off of our resultant YARI-V encoding!

I then noticed that SMOL-V had an option to strip the non-essential <id>’s from the SPIR-V shader modules. This involves removing debug instructions like OpName, OpMemberName, OpLine, etc. that aren’t required to be present for the SPIR-V to function correctly.

I added in my own option on encoding to handle stripping of the debug instructions, which resulted in:

1498.67 kilobytes 32.41%

Which is a further 176 kilobytes smaller than our non-stripped YARI-V encoding, and 130 kilobytes smaller than SMOL-V’s equivalent stripped encoding.

At this stage I thought I was golden, I’d cracked the puzzle and obviously my YARI-V encoding would be smaller than SMOL-V? Oh how I was wrong!

Results

Approach Size (kilobytes) Compression (%)
SPIR-V 4868.468750 100.000000%
SPIR-V + zstd20 590.573242 12.130575%
SMOL-V 1837.881836 37.750717%
SMOL-V + zstd20 386.879883 7.946644%
YARI-V 1675.956055 34.424706%
YARI-V + zstd20 390.577148 8.022587%
SMOL-V(stripped) 1629.115234 33.462580%
SMOL-V(stripped) + zstd20 348.073242 7.149542%
YARI-V(stripped) 1498.666016 30.783108%
YARI-V(stripped) + zstd20 364.057617 7.477867%

The brass tax is that SMOL-V, with stripping, and then fed through zstd at level 20, is 16 kilobytes smaller than the equivalent YARI-V, stripped and fed through zstd at level 20, even though YARI-V is 130 kilobytes smaller than SMOL-V when comparing the results of YARI-V against SMOL-V directly.

My main suspicion is that my approach of creating new fake opcodes, and thus allowing me to avoid outputting the word count, is probably wrong. Only 9.74% of my opcodes required a word count in the end – but across the 286932 opcodes used in the input SPIR-V shaders this means that 27956 opcodes were using the more expensive approach to encoding our word count.

My other main idea is that it would seem that SMOL-V is creating a binary stream that zstd can more easily work out how to compress – more sequences of bits must be the same in SMOL-V as compared to YARI-V. I think my approach of simply trying to compress the input SPIR-V into as concise a form as I could with YARI-V meant I lost a little of the big picture that this should have been more of a filtering step on the SPIR-V rather than a compression algorithm in its own right.

Future Work

One thing I’d like to look at is if I could remap the <id>’s in the SPIR-V (guarded by an option) such that we could increase the delta encoding success rate. At present our delta encoded IDs take up:

bytes percentage of opcodes
1 58.180900%
2 27.674441%
3 14.144659%
4 0.000000%
5 0.000000%

At present 58% of the times we delta encode we get a value that will fit within 1 byte of our zig-zagged varint encoding. 27% fits within 2 bytes, and then 14% in 3 bytes. I think the best place to start would be to try and decrease the number of times a 3 byte encoding was required, and try to map <id>’s for locality.

A great example of where this would be useful is with OpConstant’s. OpConstant’s are declared early in the SPIR-V shader module and are therefore generally given a low <id>. But they tend to be used in the body of the functions, which occurs much later on. If an OpConstant was used by an OpFMul, it would be awesome if we could make the <id> of the OpConstant close to the <id> of OpFMul to increase our chances of a 1 byte delta encoding.

Getting YARI-V

YARI-V is available on GitHub licensed under the unlicense. I hope the code is useful to someone, even though I fell short of my aim I very much enjoyed the journey of trying.

29 May

A simple Vulkan Compute example

With all the buzz surrounding Vulkan and its ability to make graphics more shiny/pretty/fast, there is one key thing seems to have been lost in the ether of information – Vulkan isn’t just a graphics API, it supports compute too! Quoting the specification (bold added for effect):

Vulkan is an API (Application Programming Interface) for graphics and compute hardware

And:

This specification defines four types of functionality that queues may support: graphics, compute, transfer, and sparse memory management.

We can see that, through how well crafted the language is, Vulkan is not only allowed to support compute, there are cases where a Vulkan driver could expose only compute.

In this vein, I’ve put together a simple Vulkan compute sample – VkComputeSample. The sample:

  • allocates two buffers
  • fills them with random data
  • creates a compute shader that will memcpy from one buffer to the other
  • then check that the data copied over successfully

Key Vulkan principles covered:

  • creating a device and queue for compute only
  • allocating memories and buffers from them
  • writing a simple compute shader
  • executing the compute shader
  • getting the results

So without further ado, let us begin.

creating a device and queue for compute only

Vulkan has a ton of boilerplate code you need to use to get ready for action.

First up we need a VkInstance. To get this, we need to look at two of Vulkan’s structs – VkApplicationInfo and VkInstanceCreateInfo:

typedef struct VkApplicationInfo {
    VkStructureType    sType;
    const void*        pNext;
    const char*        pApplicationName;
    uint32_t           applicationVersion;
    const char*        pEngineName;
    uint32_t           engineVersion;
    uint32_t           apiVersion; // care about this
} VkApplicationInfo;

typedef struct VkInstanceCreateInfo {
    VkStructureType             sType;
    const void*                 pNext;
    VkInstanceCreateFlags       flags;
    const VkApplicationInfo*    pApplicationInfo; // care about this
    uint32_t                    enabledLayerCount;
    const char* const*          ppEnabledLayerNames;
    uint32_t                    enabledExtensionCount;
    const char* const*          ppEnabledExtensionNames;
} VkInstanceCreateInfo;

I’ve flagged the only two fields we really need to care about here – apiVersion and pApplicationInfo. The most important field here is apiVersion. apiVersion will allow us to write an application against the current Vulkan specification and specify exactly which version of Vulkan we wrote our application against within the code.

Why is this important you ask?

  1. It helps future you. You’ll know which version of the specification to look at.
  2. It allows the validation layer to understand which version of Vulkan you think you are interacting with, and potentially flag up any cross version issues between your application and the drivers you are interacting with.

I recommend you always at least provide an apiVersion.

pApplicationInfo is the easier to justify – you need this to point to a valid VkApplicationInfo if you want to specify an apiVersion, which I again highly recommend you use.

Next, we need to get all the physical devices the instance can interact with:

uint32_t physicalDeviceCount = 0;
vkEnumeratePhysicalDevices(instance, &physicalDeviceCount, 0);

VkPhysicalDevice* const physicalDevices = (VkPhysicalDevice*)malloc(
   sizeof(VkPhysicalDevice) * physicalDeviceCount);

vkEnumeratePhysicalDevices(
  instance, &physicalDeviceCount, physicalDevices);

We do this by using a pair of vkEnumeratePhysicalDevices calls – one to get the number of physical devices the instance knows about, and one to fill a newly created array with handles to these physical devices.

For the purposes of the sample, I iterate through these physical devices and run my sample on each of the physical devices present in the system – but for a ‘real-world application’ you’d want to find which device best suits your workload by using vkGetPhysicalDeviceFeatures, vkGetPhysicalDeviceFormatProperties, vkGetPhysicalDeviceImageFormatProperties, vkGetPhysicalDeviceProperties, vkGetPhysicalDeviceQueueFamilyProperties and vkGetPhysicalDeviceMemoryProperties.

For each physical device we need to find a queue family for that physical device which can work for compute:

uint32_t queueFamilyPropertiesCount = 0;
vkGetPhysicalDeviceQueueFamilyProperties(
  physicalDevice, &queueFamilyPropertiesCount, 0);

VkQueueFamilyProperties* const queueFamilyProperties =
  (VkQueueFamilyProperties*)malloc(
    sizeof(VkQueueFamilyProperties) * queueFamilyPropertiesCount);

vkGetPhysicalDeviceQueueFamilyProperties(physicalDevice,
  &queueFamilyPropertiesCount, queueFamilyProperties);

We do this by using a pair of calls to vkGetPhysicalDeviceQueueFamilyProperties, the first to get the number of queue families available, and the second to fill an array of information about our queue families. In each queue family:

typedef struct VkQueueFamilyProperties {
    VkQueueFlags    queueFlags; // care about this
    uint32_t        queueCount;
    uint32_t        timestampValidBits;
    VkExtent3D      minImageTransferGranularity;
} VkQueueFamilyProperties;

We care about the queueFlags member which specifies what workloads can execute on a particular queue. A naive way to do this would be to find any queue that could handle compute workloads. A better approach would be to find a queue that only handled compute workloads (but you need to ignore the transfer bit and for our purposes the sparse binding bit too).

Once we have a valid index into our queueFamilyProperties array we allocated, we need to keep this index around – it becomes our queue family index used in various other places of the API.

Next up, create the device:

typedef struct VkDeviceQueueCreateInfo {
    VkStructureType             sType;
    const void*                 pNext;
    VkDeviceQueueCreateFlags    flags;
    uint32_t                    queueFamilyIndex; // care about this
    uint32_t                    queueCount;
    const float*                pQueuePriorities;
} VkDeviceQueueCreateInfo;

typedef struct VkDeviceCreateInfo {
    VkStructureType                    sType;
    const void*                        pNext;
    VkDeviceCreateFlags                flags;
    uint32_t                           queueCreateInfoCount; // care about this
    const VkDeviceQueueCreateInfo*     pQueueCreateInfos;    // care about this
    uint32_t                           enabledLayerCount;
    const char* const*                 ppEnabledLayerNames;
    uint32_t                           enabledExtensionCount;
    const char* const*                 ppEnabledExtensionNames;
    const VkPhysicalDeviceFeatures*    pEnabledFeatures;
} VkDeviceCreateInfo;

The queue family index we just worked out will be used in our VkDeviceQueueCreateInfo struct’s queueFamilyIndex member, and our VkDeviceCreateInfo will contain one queueCreateInfoCount, with pQueueCreateInfos set to the address of our single VkDeviceQueueCreateInfo struct.

Lastly we get our device’s queue using:

VkQueue queue;
vkGetDeviceQueue(device, queueFamilyIndex, 0, &queue);

Et voilà, we have our device, we have our queue, and we are done (with getting our device and queue at least).

allocating memories and buffers from them

To allocate buffers for use in our compute shader, we first have to allocate memory that backs the buffer – the physical location of the buffer for the device. Vulkan supports many different memory types, so we need to query for the buffer that matches our requirements. We do this by a call to vkGetPhysicalDeviceMemoryProperties, and we then find a memory that has the properties we require, and is big enough for our uses:

const VkDeviceSize memorySize; // whatever size of memory we require
for (uint32_t k = 0; k < properties.memoryTypeCount; k++) {
  const VkMemoryType memoryType = properties.memoryTypes[k];

  if ((VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT & memoryType.propertyFlags)
  && (VK_MEMORY_PROPERTY_HOST_COHERENT_BIT & memoryType.propertyFlags)
  && (memorySize < properties.memoryHeaps[memoryType.heapIndex].size)) {
    // found our memory type!
  }
}

If we know how big a memory we require, we can find an index in our VkPhysicalDeviceMemoryProperties struct that has the properties we require set, and is big enough. For the sample I’m using memory that can be host visible, and is coherent (for ease of sample writing).

With the memory type index we found above we can allocate a memory:

typedef struct VkMemoryAllocateInfo {
    VkStructureType    sType;
    const void*        pNext;
    VkDeviceSize       allocationSize;
    uint32_t           memoryTypeIndex; // care about this
} VkMemoryAllocateInfo;

We need to care about the memoryTypeIndex – which we’ll set to the index we worked out from VkPhysicalDeviceMemoryProperties before.

For the sample, I allocate one memory, and then subdivide it into two buffers. We create two storage buffers (using VK_BUFFER_USAGE_STORAGE_BUFFER_BIT), and since we do not intend to use overlapping regions of memory for the buffers our sharing mode is VK_SHARING_MODE_EXCLUSIVE. Lastly we need to specify which queue families these buffers will be used with – in our case its the one queueFamilyIndex we discovered at the start.

The link between our memories and our buffers is vkBindBufferMemory:

vkBindBufferMemory(device, in_buffer, memory, 0);
vkBindBufferMemory(device, out_buffer, memory, bufferSize);

The crucial parameter for us to use the one memory for two buffers is the last one – memoryOffset. For our second buffer we set it to begin after the first buffer has ended. Since we are creating storage buffers, we need to be sure that our memoryOffset is a multiple of the minStorageBufferOffsetAlignment member of the VkPhysicalDeviceLimits struct. For the purposes of the sample, we choose a memory size that is a large power of two, satisfying the alignment requirements on our target platforms.

The last thing we can do is fill the memory with some initial random data. To do this we map the memory, write to it, and unmap, prior to using the memory in any queue:

VkDeviceSize memorySize; // whatever size of memory we require

int32_t *payload;
vkMapMemory(device, memory, 0, memorySize, 0, (void *)&payload);

for (uint32_t k = 0; k < memorySize / sizeof(int32_t); k++) {
  payload[k] = rand();
}

vkUnmapMemory(device, memory);

And that is it, we have our memory and buffers ready to data up later.

writing a simple compute shader

My job with Codeplay is to work on the Vulkan specification with the Khronos group. My real passion within this is making compute awesome. I spend a good amount of my time working on Vulkan compute but also on SPIR-V for Vulkan. I’ve never been a happy user of GLSL compute shaders – and luckily now I don’t have to use them!

For the purposes of the sample, I’ve hand written a little compute shader to copy from a storage buffer (set = 0, binding = 0) to another storage buffer (set = 0, binding = 1). As to the details of my approach, I’ll leave that to a future blog post (it’d be a lengthy sidetrack for this post I fear).

To create a compute pipeline that we can execute with, we first create a shader module with vkCreateShaderModule. Next we need a descriptor set layout using vkCreateDescriptorSetLayout, with the following structs:

VkDescriptorSetLayoutBinding descriptorSetLayoutBindings[2] = {
  {0, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, 1, VK_SHADER_STAGE_COMPUTE_BIT, 0},
  {1, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, 1, VK_SHADER_STAGE_COMPUTE_BIT, 0}
};

VkDescriptorSetLayoutCreateInfo descriptorSetLayoutCreateInfo = {
  VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO,
  0, 0, 2, descriptorSetLayoutBindings
};

We are describing the bindings within the set we are using for our compute shader, namely we have two descriptors in the set, both of which are storage buffers being used in a compute shader.

We then use vkCreatePipelineLayout to create our pipeline layout:

typedef struct VkPipelineLayoutCreateInfo {
    VkStructureType                 sType;
    const void*                     pNext;
    VkPipelineLayoutCreateFlags     flags;
    uint32_t                        setLayoutCount; // care about this
    const VkDescriptorSetLayout*    pSetLayouts;    // care about this
    uint32_t                        pushConstantRangeCount;
    const VkPushConstantRange*      pPushConstantRanges;
} VkPipelineLayoutCreateInfo;

Since we have only one descriptor set, we set setLayoutCount to 1, and pSetLayouts to the descriptor set layout we created for our two bindings-set created before.

And then lastly we use vkCreateComputePipelines to create our compute pipeline:

VkComputePipelineCreateInfo computePipelineCreateInfo = {
  VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO,
  0, 0,
  {
    VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO,
    0, 0, VK_SHADER_STAGE_COMPUTE_BIT, shader_module, "f", 0
  },
  pipelineLayout, 0, 0
};

Our shader has one entry point called “f” for its shader, and it is a compute shader. We also need the pipeline layout we just created, and et voilà – we have our compute pipeline ready to execute with.

executing the compute shader

To execute a compute shader we need to:

  1. Create a descriptor set that has two VkDescriptorBufferInfo’s for each of our buffers (one for each binding in the compute shader).
  2. Update the descriptor set to set the bindings of both of the VkBuffer’s we created earlier.
  3. Create a command pool with our queue family index.
  4. Allocate a command buffer from the command pool (we’re using VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT as we aren’t resubmitting the buffer in our sample).
  5. Begin the command buffer.
  6. Bind our compute pipeline.
  7. Bind our descriptor set at the VK_PIPELINE_BIND_POINT_COMPUTE.
  8. Dispatch a compute shader for each element of our buffer.
  9. End the command buffer.
  10. And submit it to the queue!

getting the results

To get the results from a submitted command buffer, the coarse way to do this is to use vkQueueWaitIdle – wait for all command buffers submitted to a queue to complete. For our purposes, we are submitting one queue, and waiting for it to complete, so it is the perfect tool for our sample – but broadly speaking you are better chaining dependent submissions together with VkSemaphore’s, and using a VkFence for only the queue at the end of the workload to ensure the execution has complete.

Once we’ve waited on the queue, we simply map the memory and check that the first half of the buffer equals the second half – EG. the memcpy of the elements succeeded:

int32_t *payload;
vkMapMemory(device, memory, 0, memorySize, 0, (void *)&payload);

for (uint32_t k = 0, e = bufferSize / sizeof(int32_t); k < e; k++) {
  assert(payload[k + e] == payload[k]);
}

And we are done! We have written our first memcpy sample in Vulkan compute shaders.

fin

The sample is dirty in ‘real-world application’ terms – it doesn’t free any of the Vulkan objects that need to be freed on completion. TL;DR one of the drivers I am testing on loves to segfault on perfectly valid code (and yes, for any IHV’s reading this I have already flagged this up with the relevant vendor!).

But for the purposes of explaining an easy Vulkan compute sample to all the compute lovers among my readership I hope the above gives you a good overview of exactly how to do that – yes there are many hoops to jump through to get something executing, but the sheer level of control that can be achieved through the Vulkan API far outweighs a few extra lines of code we need.

The full sample is available at the GitHub gist here.

Stay tuned for more Vulkan compute examples to come in future posts!