29 May

A simple Vulkan Compute example

With all the buzz surrounding Vulkan and its ability to make graphics more shiny/pretty/fast, there is one key thing seems to have been lost in the ether of information – Vulkan isn’t just a graphics API, it supports compute too! Quoting the specification (bold added for effect):

Vulkan is an API (Application Programming Interface) for graphics and compute hardware

And:

This specification defines four types of functionality that queues may support: graphics, compute, transfer, and sparse memory management.

We can see that, through how well crafted the language is, Vulkan is not only allowed to support compute, there are cases where a Vulkan driver could expose only compute.

In this vein, I’ve put together a simple Vulkan compute sample – VkComputeSample. The sample:

  • allocates two buffers
  • fills them with random data
  • creates a compute shader that will memcpy from one buffer to the other
  • then check that the data copied over successfully

Key Vulkan principles covered:

  • creating a device and queue for compute only
  • allocating memories and buffers from them
  • writing a simple compute shader
  • executing the compute shader
  • getting the results

So without further ado, let us begin.

creating a device and queue for compute only

Vulkan has a ton of boilerplate code you need to use to get ready for action.

First up we need a VkInstance. To get this, we need to look at two of Vulkan’s structs – VkApplicationInfo and VkInstanceCreateInfo:

typedef struct VkApplicationInfo {
    VkStructureType    sType;
    const void*        pNext;
    const char*        pApplicationName;
    uint32_t           applicationVersion;
    const char*        pEngineName;
    uint32_t           engineVersion;
    uint32_t           apiVersion; // care about this
} VkApplicationInfo;

typedef struct VkInstanceCreateInfo {
    VkStructureType             sType;
    const void*                 pNext;
    VkInstanceCreateFlags       flags;
    const VkApplicationInfo*    pApplicationInfo; // care about this
    uint32_t                    enabledLayerCount;
    const char* const*          ppEnabledLayerNames;
    uint32_t                    enabledExtensionCount;
    const char* const*          ppEnabledExtensionNames;
} VkInstanceCreateInfo;

I’ve flagged the only two fields we really need to care about here – apiVersion and pApplicationInfo. The most important field here is apiVersion. apiVersion will allow us to write an application against the current Vulkan specification and specify exactly which version of Vulkan we wrote our application against within the code.

Why is this important you ask?

  1. It helps future you. You’ll know which version of the specification to look at.
  2. It allows the validation layer to understand which version of Vulkan you think you are interacting with, and potentially flag up any cross version issues between your application and the drivers you are interacting with.

I recommend you always at least provide an apiVersion.

pApplicationInfo is the easier to justify – you need this to point to a valid VkApplicationInfo if you want to specify an apiVersion, which I again highly recommend you use.

Next, we need to get all the physical devices the instance can interact with:

uint32_t physicalDeviceCount = 0;
vkEnumeratePhysicalDevices(instance, &physicalDeviceCount, 0);

VkPhysicalDevice* const physicalDevices = (VkPhysicalDevice*)malloc(
   sizeof(VkPhysicalDevice) * physicalDeviceCount);

vkEnumeratePhysicalDevices(
  instance, &physicalDeviceCount, physicalDevices);

We do this by using a pair of vkEnumeratePhysicalDevices calls – one to get the number of physical devices the instance knows about, and one to fill a newly created array with handles to these physical devices.

For the purposes of the sample, I iterate through these physical devices and run my sample on each of the physical devices present in the system – but for a ‘real-world application’ you’d want to find which device best suits your workload by using vkGetPhysicalDeviceFeatures, vkGetPhysicalDeviceFormatProperties, vkGetPhysicalDeviceImageFormatProperties, vkGetPhysicalDeviceProperties, vkGetPhysicalDeviceQueueFamilyProperties and vkGetPhysicalDeviceMemoryProperties.

For each physical device we need to find a queue family for that physical device which can work for compute:

uint32_t queueFamilyPropertiesCount = 0;
vkGetPhysicalDeviceQueueFamilyProperties(
  physicalDevice, &queueFamilyPropertiesCount, 0);

VkQueueFamilyProperties* const queueFamilyProperties =
  (VkQueueFamilyProperties*)malloc(
    sizeof(VkQueueFamilyProperties) * queueFamilyPropertiesCount);

vkGetPhysicalDeviceQueueFamilyProperties(physicalDevice,
  &queueFamilyPropertiesCount, queueFamilyProperties);

We do this by using a pair of calls to vkGetPhysicalDeviceQueueFamilyProperties, the first to get the number of queue families available, and the second to fill an array of information about our queue families. In each queue family:

typedef struct VkQueueFamilyProperties {
    VkQueueFlags    queueFlags; // care about this
    uint32_t        queueCount;
    uint32_t        timestampValidBits;
    VkExtent3D      minImageTransferGranularity;
} VkQueueFamilyProperties;

We care about the queueFlags member which specifies what workloads can execute on a particular queue. A naive way to do this would be to find any queue that could handle compute workloads. A better approach would be to find a queue that only handled compute workloads (but you need to ignore the transfer bit and for our purposes the sparse binding bit too).

Once we have a valid index into our queueFamilyProperties array we allocated, we need to keep this index around – it becomes our queue family index used in various other places of the API.

Next up, create the device:

typedef struct VkDeviceQueueCreateInfo {
    VkStructureType             sType;
    const void*                 pNext;
    VkDeviceQueueCreateFlags    flags;
    uint32_t                    queueFamilyIndex; // care about this
    uint32_t                    queueCount;
    const float*                pQueuePriorities;
} VkDeviceQueueCreateInfo;

typedef struct VkDeviceCreateInfo {
    VkStructureType                    sType;
    const void*                        pNext;
    VkDeviceCreateFlags                flags;
    uint32_t                           queueCreateInfoCount; // care about this
    const VkDeviceQueueCreateInfo*     pQueueCreateInfos;    // care about this
    uint32_t                           enabledLayerCount;
    const char* const*                 ppEnabledLayerNames;
    uint32_t                           enabledExtensionCount;
    const char* const*                 ppEnabledExtensionNames;
    const VkPhysicalDeviceFeatures*    pEnabledFeatures;
} VkDeviceCreateInfo;

The queue family index we just worked out will be used in our VkDeviceQueueCreateInfo struct’s queueFamilyIndex member, and our VkDeviceCreateInfo will contain one queueCreateInfoCount, with pQueueCreateInfos set to the address of our single VkDeviceQueueCreateInfo struct.

Lastly we get our device’s queue using:

VkQueue queue;
vkGetDeviceQueue(device, queueFamilyIndex, 0, &queue);

Et voilà, we have our device, we have our queue, and we are done (with getting our device and queue at least).

allocating memories and buffers from them

To allocate buffers for use in our compute shader, we first have to allocate memory that backs the buffer – the physical location of the buffer for the device. Vulkan supports many different memory types, so we need to query for the buffer that matches our requirements. We do this by a call to vkGetPhysicalDeviceMemoryProperties, and we then find a memory that has the properties we require, and is big enough for our uses:

const VkDeviceSize memorySize; // whatever size of memory we require
for (uint32_t k = 0; k < properties.memoryTypeCount; k++) {
  const VkMemoryType memoryType = properties.memoryTypes[k];

  if ((VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT & memoryType.propertyFlags)
  && (VK_MEMORY_PROPERTY_HOST_COHERENT_BIT & memoryType.propertyFlags)
  && (memorySize < properties.memoryHeaps[memoryType.heapIndex].size)) {
    // found our memory type!
  }
}

If we know how big a memory we require, we can find an index in our VkPhysicalDeviceMemoryProperties struct that has the properties we require set, and is big enough. For the sample I’m using memory that can be host visible, and is coherent (for ease of sample writing).

With the memory type index we found above we can allocate a memory:

typedef struct VkMemoryAllocateInfo {
    VkStructureType    sType;
    const void*        pNext;
    VkDeviceSize       allocationSize;
    uint32_t           memoryTypeIndex; // care about this
} VkMemoryAllocateInfo;

We need to care about the memoryTypeIndex – which we’ll set to the index we worked out from VkPhysicalDeviceMemoryProperties before.

For the sample, I allocate one memory, and then subdivide it into two buffers. We create two storage buffers (using VK_BUFFER_USAGE_STORAGE_BUFFER_BIT), and since we do not intend to use overlapping regions of memory for the buffers our sharing mode is VK_SHARING_MODE_EXCLUSIVE. Lastly we need to specify which queue families these buffers will be used with – in our case its the one queueFamilyIndex we discovered at the start.

The link between our memories and our buffers is vkBindBufferMemory:

vkBindBufferMemory(device, in_buffer, memory, 0);
vkBindBufferMemory(device, out_buffer, memory, bufferSize);

The crucial parameter for us to use the one memory for two buffers is the last one – memoryOffset. For our second buffer we set it to begin after the first buffer has ended. Since we are creating storage buffers, we need to be sure that our memoryOffset is a multiple of the minStorageBufferOffsetAlignment member of the VkPhysicalDeviceLimits struct. For the purposes of the sample, we choose a memory size that is a large power of two, satisfying the alignment requirements on our target platforms.

The last thing we can do is fill the memory with some initial random data. To do this we map the memory, write to it, and unmap, prior to using the memory in any queue:

VkDeviceSize memorySize; // whatever size of memory we require

int32_t *payload;
vkMapMemory(device, memory, 0, memorySize, 0, (void *)&payload);

for (uint32_t k = 0; k < memorySize / sizeof(int32_t); k++) {
  payload[k] = rand();
}

vkUnmapMemory(device, memory);

And that is it, we have our memory and buffers ready to data up later.

writing a simple compute shader

My job with Codeplay is to work on the Vulkan specification with the Khronos group. My real passion within this is making compute awesome. I spend a good amount of my time working on Vulkan compute but also on SPIR-V for Vulkan. I’ve never been a happy user of GLSL compute shaders – and luckily now I don’t have to use them!

For the purposes of the sample, I’ve hand written a little compute shader to copy from a storage buffer (set = 0, binding = 0) to another storage buffer (set = 0, binding = 1). As to the details of my approach, I’ll leave that to a future blog post (it’d be a lengthy sidetrack for this post I fear).

To create a compute pipeline that we can execute with, we first create a shader module with vkCreateShaderModule. Next we need a descriptor set layout using vkCreateDescriptorSetLayout, with the following structs:

VkDescriptorSetLayoutBinding descriptorSetLayoutBindings[2] = {
  {0, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, 1, VK_SHADER_STAGE_COMPUTE_BIT, 0},
  {1, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, 1, VK_SHADER_STAGE_COMPUTE_BIT, 0}
};

VkDescriptorSetLayoutCreateInfo descriptorSetLayoutCreateInfo = {
  VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO,
  0, 0, 2, descriptorSetLayoutBindings
};

We are describing the bindings within the set we are using for our compute shader, namely we have two descriptors in the set, both of which are storage buffers being used in a compute shader.

We then use vkCreatePipelineLayout to create our pipeline layout:

typedef struct VkPipelineLayoutCreateInfo {
    VkStructureType                 sType;
    const void*                     pNext;
    VkPipelineLayoutCreateFlags     flags;
    uint32_t                        setLayoutCount; // care about this
    const VkDescriptorSetLayout*    pSetLayouts;    // care about this
    uint32_t                        pushConstantRangeCount;
    const VkPushConstantRange*      pPushConstantRanges;
} VkPipelineLayoutCreateInfo;

Since we have only one descriptor set, we set setLayoutCount to 1, and pSetLayouts to the descriptor set layout we created for our two bindings-set created before.

And then lastly we use vkCreateComputePipelines to create our compute pipeline:

VkComputePipelineCreateInfo computePipelineCreateInfo = {
  VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO,
  0, 0,
  {
    VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO,
    0, 0, VK_SHADER_STAGE_COMPUTE_BIT, shader_module, "f", 0
  },
  pipelineLayout, 0, 0
};

Our shader has one entry point called “f” for its shader, and it is a compute shader. We also need the pipeline layout we just created, and et voilà – we have our compute pipeline ready to execute with.

executing the compute shader

To execute a compute shader we need to:

  1. Create a descriptor set that has two VkDescriptorBufferInfo’s for each of our buffers (one for each binding in the compute shader).
  2. Update the descriptor set to set the bindings of both of the VkBuffer’s we created earlier.
  3. Create a command pool with our queue family index.
  4. Allocate a command buffer from the command pool (we’re using VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT as we aren’t resubmitting the buffer in our sample).
  5. Begin the command buffer.
  6. Bind our compute pipeline.
  7. Bind our descriptor set at the VK_PIPELINE_BIND_POINT_COMPUTE.
  8. Dispatch a compute shader for each element of our buffer.
  9. End the command buffer.
  10. And submit it to the queue!

getting the results

To get the results from a submitted command buffer, the coarse way to do this is to use vkQueueWaitIdle – wait for all command buffers submitted to a queue to complete. For our purposes, we are submitting one queue, and waiting for it to complete, so it is the perfect tool for our sample – but broadly speaking you are better chaining dependent submissions together with VkSemaphore’s, and using a VkFence for only the queue at the end of the workload to ensure the execution has complete.

Once we’ve waited on the queue, we simply map the memory and check that the first half of the buffer equals the second half – EG. the memcpy of the elements succeeded:

int32_t *payload;
vkMapMemory(device, memory, 0, memorySize, 0, (void *)&payload);

for (uint32_t k = 0, e = bufferSize / sizeof(int32_t); k < e; k++) {
  assert(payload[k + e] == payload[k]);
}

And we are done! We have written our first memcpy sample in Vulkan compute shaders.

fin

The sample is dirty in ‘real-world application’ terms – it doesn’t free any of the Vulkan objects that need to be freed on completion. TL;DR one of the drivers I am testing on loves to segfault on perfectly valid code (and yes, for any IHV’s reading this I have already flagged this up with the relevant vendor!).

But for the purposes of explaining an easy Vulkan compute sample to all the compute lovers among my readership I hope the above gives you a good overview of exactly how to do that – yes there are many hoops to jump through to get something executing, but the sheer level of control that can be achieved through the Vulkan API far outweighs a few extra lines of code we need.

The full sample is available at the GitHub gist here.

Stay tuned for more Vulkan compute examples to come in future posts!

22 thoughts on “A simple Vulkan Compute example

  1. Nice. This is super interesting stuff.

    How about an example of compiling an OpenCL kernel to SPIR-V and then loading it at runtime?

    A suggestion for a future blog post might be to cover the differences between the OpenCL and Vulkan runtimes. For example, how should devs enforce OpenCL-style host-declared event wait lists in Vulkan? What can you do in OpenCL that you can’t do in Vulkan — both easily and/or not at all?

    • I’ll file your very good comments away for future blog post ideas for sure!

      The OpenCL kernel -> SPIR-V is… interesting, its non-obvious at an initial glance why it wouldn’t ‘just work’, so I think I could dedicate a post to explaining exactly why this is an issue for sure.

      And the differences between OpenCL at rather nuanced that that will deserve another post for sure.

      Thanks!

  2. Hello,
    I am currently in an image processing class and so far we have been using a single thread to implement things like heat flow for image correction.
    I followed your blog and code and was able to get it working along with my modifications for my class.
    I tested the speed that my single threaded heat flow runs versus my Vulkan version and my Vulkan version is more than 10 times slower.

    I’m still not quite sure what I’m doing, but I suspect that my reduced efficiency is either due to my compute shader or the way I submit my commands.

    If you could spare some time, I have put my code in a repository:

    https://github.com/MichaelMitchellM/VulkanComputeNumericalPDEs

    To quickly navigate to the queue submission area, ctrl + F SUSPECT

    Thank you for your time,
    Michael Mitchell

    • So one thing at a first glance is that in your compute shader you didn’t specify a local work group size. You want to query VkPhysicalDeviceLimits for maxComputeWorkGroupSize and maxComputeWorkGroupInvocations and set it. On AMD a local work group of 64 seems mostly ok, and NVIDIA 32 is ok – but you’ll need to experiment for your own code of course! Remember to divide your global work size you specify in VkCmdDispatch by the size you set the local work group too. I didn’t want to add extra logic to my already rather large ‘simple’ example to query them for my own case 🙂

  3. Very interesting. Thank you for writing this. I spent quite a while looking for something like this a few weeks ago, but your’s appears to be the first one. My question is, though, why would someone choose to use this over, say, Cuda and OpenCL, which are so much easier to use. It’s all extremely low level. Almost as verbose as kernel code.

    • The biggest reason is control – with Vulkan you get a much finer grained control over exactly what your CPU and GPU is doing.

      Vulkan is a huge win over workloads that are going to be submitted multiple times. Any workload where you want to submit it multiple times you will see performance improvements with Vulkan where you can express this to the driver much more succinctly (look at VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT for how to control this behaviour).

  4. ‘A better approach would be to find a queue that only handled compute workloads (but you need to ignore the transfer bit and for our purposes the sparse binding bit too)’. I do not understand here what do you mean by ignoring the transfer bit and the sparse binding bit. And also can you explain what is sparse binding.

    • So the transfer bit is signifying that a queue can perform some form of DMA move operations – copying data to and from a device. AMD’s driver exposes a queue for this that will asynchronously copy data to and from the GPU while the main queue can perform graphics work. What I was trying to say here was that you probably want to try and find a queue that has the compute bit set, but not the graphics bit set – a queue dedicated to compute only. Failing that you could fall back to any queue that exposed compute.

      The sparse bit basically signifies that the device supports sparse memory – you can read more about the functionality here – Sparse Memory Example.

      For our purposes we do not need sparse memory so we can effectively ignore the bit.

  5. So, it boils down to 600+ LOC just to copy a buffer 🙁
    I wish the APIs could provide a better abstraction. This was one of the problems OpenCL had and might be the reason that it didn’t get picked by the community

    • On the face of it – you are correct. Lots of lines, just does a memcpy.

      But you’ve got to see the big picture of this – you or anyone else in the community can make whatever API abstraction on top of Vulkan you want to meet your needs.

      Vulkan maps in a very real 1:1 way with how the hardware is actually going to execute your code – the biggest complaint we had for ages about all the existing APIs was they were too high level, and didn’t provide enough control. Vulkan is the answer to this – sure it won’t be perfect for everyone’s needs (it is purposefully verbose) but it gives you control.

  6. Hi Neil,
    First of all great article/tutorial !

    Now you have written the compute shader code in SPIR-V format. Can you post or send link to the GLSL version of compute shader?

    • I used https://github.com/KhronosGroup/SPIRV-Cross to decompile the SPIR-V shader code. Here it is:

      #version 430
      layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;

      layout(binding = 0, std430) buffer _9
      {
      int data[16384];
      } inArray;

      layout(binding = 1, std430) buffer _9
      {
      int data[16384];
      } outArray;

      void main()
      {
      outArray.data[gl_GlobalInvocationID.x] = inArray.data[gl_GlobalInvocationID.x];
      }

      • Previous shader code won’t compile.
        It will give you error like these;
        Warning, version 430 is not yet complete; most version-specific features are present, but some are missing.
        ERROR: computeShader.comp:9: ‘_9’ : Cannot reuse block name within the same interface: buffer
        ERROR: 1 compilation errors. No code generated.

        Linked compute stage:

        ERROR: Linking compute stage: Missing entry point: Each stage requires one “void main()” entry point

        SPIR-V is not generated for failed compile or link

        The following code will compile just fine:

        #version 430
        layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;

        layout(binding = 0, std430) buffer a
        {
        int data[16384];
        } inArray;

        layout(binding = 1, std430) buffer b
        {
        int data[16384];
        } outArray;

        void main()
        {
        outArray.data[gl_GlobalInvocationID.x] = inArray.data[gl_GlobalInvocationID.x];
        }

        • Also don’t forget to set pipelineShaderStageCreateInfo.pName = “main”;
          This is the name of function that gets executed.

  7. Say I want to write a matrix multiplication thing. Would using storage buffers be the way to go or using textures or images as the case before compute shaders came to existence. By the way thanks for the excellent tutorial!

    • Use storage buffers for sure! I’d personally only do the texture/image lark if I was planning on converting a compute shader -> fragment shader, or I was pipelining a compute shader with some graphics shaders!

      • Thanks for answering! Currently I have this implementation. I was wondering would using one over the other be faster or better in some sense?

        #version 450
        #extension GL_ARB_separate_shader_objects : enable

        layout (local_size_x = 512) in;
        layout(binding = 0) readonly buffer matrix_a{
        vec4 A[];
        };
        layout(binding = 1) readonly buffer matrix_b_transpose{
        vec4 B_t[];
        };
        layout(binding = 2) buffer matrix_c{
        float C[];
        };

        // C = A*B, C=M*N, A=M*K, B = K*N
        //K should be divisible by 4, i.e, zero padded to both A and B
        layout (binding = 3) uniform UBO
        {
        int M;
        int N;
        int K;
        } ubo;

        void main()
        {
        uint index = gl_GlobalInvocationID.x;
        if (index >= ubo.M*ubo.N)
        return;
        int row = int(index/ubo.N);
        int column = int(mod(index,ubo.N));

        float sum = 0.;
        int a_row_start_index = ubo.K/4 * row;
        int b_column_start_index = ubo.K/4*column;

        for(int i = 0; i < ubo.K/4; i++){
        sum += dot(A[a_row_start_index + i],B_t[b_column_start_index + i]);
        }
        C[index] = sum;
        }

        • Honestly with any of these things I just compile it, use spirv-dis to disassemble the SPIR-V binary, and have a look at the opcodes. You’ll get a pretty good idea whether the code is optimal or not then!

  8. Sorry but I am out of my depth here. How can one see the opcodes and know? Can you please provide an explanation or any references? And thanks a lot for your replies. I truly very much appreciate them.

    • Ah so basically if you had two different approaches that you weren’t sure which would be more efficient, the obvious first thing to do is just to run the code and see which runs faster over N iterations.

      Then, what I tend to do is use spirv-dis on the .spv files, and look at the opcodes and see if I can notice anything silly that the compiler hasn’t been able to do in either case.

  9. You may have the meaning of VK_SHARING_MODE_EXCLUSIVE slightly wrong. As far as I can tell from the spec it’s indicating that the resource is not shared among queues, not about overlapping resource storage.

Comments are closed.