A simple Vulkan Compute example

With all the buzz surrounding Vulkan and its ability to make graphics more shiny/pretty/fast, there is one key thing seems to have been lost in the ether of information - Vulkan isn’t just a graphics API, it supports compute too! Quoting the specification (bold added for effect):

Vulkan is an API (Application Programming Interface) for graphics and compute hardware

And:

This specification defines four types of functionality that queues may support: graphics, compute, transfer, and sparse memory management.

We can see that, through how well crafted the language is, Vulkan is not only allowed to support compute, there are cases where a Vulkan driver could expose only compute.

In this vein, I’ve put together a simple Vulkan compute sample - VkComputeSample. The sample:

allocates two buffers
fills them with random data
creates a compute shader that will memcpy from one buffer to the other
then check that the data copied over successfully

Key Vulkan principles covered:

creating a device and queue for compute only
allocating memories and buffers from them
writing a simple compute shader
executing the compute shader
getting the results

So without further ado, let us begin.

creating a device and queue for compute only⌗

Vulkan has a ton of boilerplate code you need to use to get ready for action.

First up we need a VkInstance. To get this, we need to look at two of Vulkan’s structs - VkApplicationInfo and VkInstanceCreateInfo:

typedef struct VkApplicationInfo {
    VkStructureType    sType;
    const void*        pNext;
    const char*        pApplicationName;
    uint32_t           applicationVersion;
    const char*        pEngineName;
    uint32_t           engineVersion;
    uint32_t           apiVersion; // care about this
} VkApplicationInfo;

typedef struct VkInstanceCreateInfo {
    VkStructureType             sType;
    const void*                 pNext;
    VkInstanceCreateFlags       flags;
    const VkApplicationInfo*    pApplicationInfo; // care about this
    uint32_t                    enabledLayerCount;
    const char* const*          ppEnabledLayerNames;
    uint32_t                    enabledExtensionCount;
    const char* const*          ppEnabledExtensionNames;
} VkInstanceCreateInfo;

I’ve flagged the only two fields we really need to care about here - apiVersion and pApplicationInfo. The most important field here is apiVersion. apiVersion will allow us to write an application against the current Vulkan specification and specify exactly which version of Vulkan we wrote our application against within the code.

Why is this important you ask?

It helps future you. You’ll know which version of the specification to look at.
It allows the validation layer to understand which version of Vulkan you think you are interacting with, and potentially flag up any cross version issues between your application and the drivers you are interacting with.

I recommend you always at least provide an apiVersion.

pApplicationInfo is the easier to justify - you need this to point to a valid VkApplicationInfo if you want to specify an apiVersion, which I again highly recommend you use.

Next, we need to get all the physical devices the instance can interact with:

uint32_t physicalDeviceCount = 0;
vkEnumeratePhysicalDevices(instance, &physicalDeviceCount, 0);

VkPhysicalDevice* const physicalDevices = (VkPhysicalDevice*)malloc(
   sizeof(VkPhysicalDevice) * physicalDeviceCount);

vkEnumeratePhysicalDevices(
  instance, &physicalDeviceCount, physicalDevices);

We do this by using a pair of vkEnumeratePhysicalDevices calls - one to get the number of physical devices the instance knows about, and one to fill a newly created array with handles to these physical devices.

For the purposes of the sample, I iterate through these physical devices and run my sample on each of the physical devices present in the system - but for a ‘real-world application’ you’d want to find which device best suits your workload by using vkGetPhysicalDeviceFeatures, vkGetPhysicalDeviceFormatProperties, vkGetPhysicalDeviceImageFormatProperties, vkGetPhysicalDeviceProperties, vkGetPhysicalDeviceQueueFamilyProperties and vkGetPhysicalDeviceMemoryProperties.

For each physical device we need to find a queue family for that physical device which can work for compute:

uint32_t queueFamilyPropertiesCount = 0;
vkGetPhysicalDeviceQueueFamilyProperties(
  physicalDevice, &queueFamilyPropertiesCount, 0);

VkQueueFamilyProperties* const queueFamilyProperties =
  (VkQueueFamilyProperties*)malloc(
    sizeof(VkQueueFamilyProperties) * queueFamilyPropertiesCount);

vkGetPhysicalDeviceQueueFamilyProperties(physicalDevice,
  &queueFamilyPropertiesCount, queueFamilyProperties);

We do this by using a pair of calls to vkGetPhysicalDeviceQueueFamilyProperties, the first to get the number of queue families available, and the second to fill an array of information about our queue families. In each queue family:

typedef struct VkQueueFamilyProperties {
    VkQueueFlags    queueFlags; // care about this
    uint32_t        queueCount;
    uint32_t        timestampValidBits;
    VkExtent3D      minImageTransferGranularity;
} VkQueueFamilyProperties;

We care about the queueFlags member which specifies what workloads can execute on a particular queue. A naive way to do this would be to find any queue that could handle compute workloads. A better approach would be to find a queue that only handled compute workloads (but you need to ignore the transfer bit and for our purposes the sparse binding bit too).

Once we have a valid index into our queueFamilyProperties array we allocated, we need to keep this index around - it becomes our queue family index used in various other places of the API.

Next up, create the device:

typedef struct VkDeviceQueueCreateInfo {
    VkStructureType             sType;
    const void*                 pNext;
    VkDeviceQueueCreateFlags    flags;
    uint32_t                    queueFamilyIndex; // care about this
    uint32_t                    queueCount;
    const float*                pQueuePriorities;
} VkDeviceQueueCreateInfo;

typedef struct VkDeviceCreateInfo {
    VkStructureType                    sType;
    const void*                        pNext;
    VkDeviceCreateFlags                flags;
    uint32_t                           queueCreateInfoCount; // care about this
    const VkDeviceQueueCreateInfo*     pQueueCreateInfos;    // care about this
    uint32_t                           enabledLayerCount;
    const char* const*                 ppEnabledLayerNames;
    uint32_t                           enabledExtensionCount;
    const char* const*                 ppEnabledExtensionNames;
    const VkPhysicalDeviceFeatures*    pEnabledFeatures;
} VkDeviceCreateInfo;

The queue family index we just worked out will be used in our VkDeviceQueueCreateInfo struct’s queueFamilyIndex member, and our VkDeviceCreateInfo will contain one queueCreateInfoCount, with pQueueCreateInfos set to the address of our single VkDeviceQueueCreateInfo struct.

Lastly we get our device’s queue using:

VkQueue queue;
vkGetDeviceQueue(device, queueFamilyIndex, 0, &queue);

Et voilà, we have our device, we have our queue, and we are done (with getting our device and queue at least).

allocating memories and buffers from them⌗

To allocate buffers for use in our compute shader, we first have to allocate memory that backs the buffer - the physical location of the buffer for the device. Vulkan supports many different memory types, so we need to query for the buffer that matches our requirements. We do this by a call to vkGetPhysicalDeviceMemoryProperties, and we then find a memory that has the properties we require, and is big enough for our uses:

const VkDeviceSize memorySize; // whatever size of memory we require
for (uint32_t k = 0; k &lt; properties.memoryTypeCount; k++) {
  const VkMemoryType memoryType = properties.memoryTypes[k];

  if ((VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT & memoryType.propertyFlags)
  && (VK_MEMORY_PROPERTY_HOST_COHERENT_BIT & memoryType.propertyFlags)
  && (memorySize &lt; properties.memoryHeaps[memoryType.heapIndex].size)) {
    // found our memory type!
  }
}

If we know how big a memory we require, we can find an index in our VkPhysicalDeviceMemoryProperties struct that has the properties we require set, and is big enough. For the sample I’m using memory that can be host visible, and is coherent (for ease of sample writing).

With the memory type index we found above we can allocate a memory:

typedef struct VkMemoryAllocateInfo {
    VkStructureType    sType;
    const void*        pNext;
    VkDeviceSize       allocationSize;
    uint32_t           memoryTypeIndex; // care about this
} VkMemoryAllocateInfo;

We need to care about the memoryTypeIndex - which we’ll set to the index we worked out from VkPhysicalDeviceMemoryProperties before.

For the sample, I allocate one memory, and then subdivide it into two buffers. We create two storage buffers (using VK_BUFFER_USAGE_STORAGE_BUFFER_BIT), and since we do not intend to use overlapping regions of memory for the buffers our sharing mode is VK_SHARING_MODE_EXCLUSIVE. Lastly we need to specify which queue families these buffers will be used with - in our case its the one queueFamilyIndex we discovered at the start.

The link between our memories and our buffers is vkBindBufferMemory:

vkBindBufferMemory(device, in_buffer, memory, 0);
vkBindBufferMemory(device, out_buffer, memory, bufferSize);

The crucial parameter for us to use the one memory for two buffers is the last one - memoryOffset. For our second buffer we set it to begin after the first buffer has ended. Since we are creating storage buffers, we need to be sure that our memoryOffset is a multiple of the minStorageBufferOffsetAlignment member of the VkPhysicalDeviceLimits struct. For the purposes of the sample, we choose a memory size that is a large power of two, satisfying the alignment requirements on our target platforms.

The last thing we can do is fill the memory with some initial random data. To do this we map the memory, write to it, and unmap, prior to using the memory in any queue:

VkDeviceSize memorySize; // whatever size of memory we require

int32_t *payload;
vkMapMemory(device, memory, 0, memorySize, 0, (void *)&payload);

for (uint32_t k = 0; k &lt; memorySize / sizeof(int32_t); k++) {
  payload[k] = rand();
}

vkUnmapMemory(device, memory);

And that is it, we have our memory and buffers ready to data up later.

writing a simple compute shader⌗

My job with Codeplay is to work on the Vulkan specification with the Khronos group. My real passion within this is making compute awesome. I spend a good amount of my time working on Vulkan compute but also on SPIR-V for Vulkan. I’ve never been a happy user of GLSL compute shaders - and luckily now I don’t have to use them!

For the purposes of the sample, I’ve hand written a little compute shader to copy from a storage buffer (set = 0, binding = 0) to another storage buffer (set = 0, binding = 1). As to the details of my approach, I’ll leave that to a future blog post (it’d be a lengthy sidetrack for this post I fear).

To create a compute pipeline that we can execute with, we first create a shader module with vkCreateShaderModule. Next we need a descriptor set layout using vkCreateDescriptorSetLayout, with the following structs:

VkDescriptorSetLayoutBinding descriptorSetLayoutBindings[2] = {
  {0, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, 1, VK_SHADER_STAGE_COMPUTE_BIT, 0},
  {1, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, 1, VK_SHADER_STAGE_COMPUTE_BIT, 0}
};

VkDescriptorSetLayoutCreateInfo descriptorSetLayoutCreateInfo = {
  VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO,
  0, 0, 2, descriptorSetLayoutBindings
};

We are describing the bindings within the set we are using for our compute shader, namely we have two descriptors in the set, both of which are storage buffers being used in a compute shader.

We then use vkCreatePipelineLayout to create our pipeline layout:

typedef struct VkPipelineLayoutCreateInfo {
    VkStructureType                 sType;
    const void*                     pNext;
    VkPipelineLayoutCreateFlags     flags;
    uint32_t                        setLayoutCount; // care about this
    const VkDescriptorSetLayout*    pSetLayouts;    // care about this
    uint32_t                        pushConstantRangeCount;
    const VkPushConstantRange*      pPushConstantRanges;
} VkPipelineLayoutCreateInfo;

Since we have only one descriptor set, we set setLayoutCount to 1, and pSetLayouts to the descriptor set layout we created for our two bindings-set created before.

And then lastly we use vkCreateComputePipelines to create our compute pipeline:

VkComputePipelineCreateInfo computePipelineCreateInfo = {
  VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO,
  0, 0,
  {
    VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO,
    0, 0, VK_SHADER_STAGE_COMPUTE_BIT, shader_module, "f", 0
  },
  pipelineLayout, 0, 0
};

Our shader has one entry point called “f” for its shader, and it is a compute shader. We also need the pipeline layout we just created, and et voilà - we have our compute pipeline ready to execute with.

executing the compute shader⌗

To execute a compute shader we need to:

Create a descriptor set that has two VkDescriptorBufferInfo’s for each of our buffers (one for each binding in the compute shader).
Update the descriptor set to set the bindings of both of the VkBuffer’s we created earlier.
Create a command pool with our queue family index.
Allocate a command buffer from the command pool (we’re using VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT as we aren’t resubmitting the buffer in our sample).
Begin the command buffer.
Bind our compute pipeline.
Bind our descriptor set at the VK_PIPELINE_BIND_POINT_COMPUTE.
Dispatch a compute shader for each element of our buffer.
End the command buffer.
And submit it to the queue!

getting the results⌗

To get the results from a submitted command buffer, the coarse way to do this is to use vkQueueWaitIdle - wait for all command buffers submitted to a queue to complete. For our purposes, we are submitting one queue, and waiting for it to complete, so it is the perfect tool for our sample - but broadly speaking you are better chaining dependent submissions together with VkSemaphore’s, and using a VkFence for only the queue at the end of the workload to ensure the execution has complete.

Once we’ve waited on the queue, we simply map the memory and check that the first half of the buffer equals the second half - EG. the memcpy of the elements succeeded:

int32_t *payload;
vkMapMemory(device, memory, 0, memorySize, 0, (void *)&payload);

for (uint32_t k = 0, e = bufferSize / sizeof(int32_t); k &lt; e; k++) {
  assert(payload[k + e] == payload[k]);
}

And we are done! We have written our first memcpy sample in Vulkan compute shaders.

fin⌗

The sample is dirty in ‘real-world application’ terms - it doesn’t free any of the Vulkan objects that need to be freed on completion. TL;DR one of the drivers I am testing on loves to segfault on perfectly valid code (and yes, for any IHV’s reading this I have already flagged this up with the relevant vendor!).

But for the purposes of explaining an easy Vulkan compute sample to all the compute lovers among my readership I hope the above gives you a good overview of exactly how to do that - yes there are many hoops to jump through to get something executing, but the sheer level of control that can be achieved through the Vulkan API far outweighs a few extra lines of code we need.

The full sample is available at the GitHub gist here.

Stay tuned for more Vulkan compute examples to come in future posts!