07 Sep

I’m speaking at the Munich Khronos Chapter Meeting 13th October 2017

Previously I had begun a series of blog posts detailing how to port applications from OpenCL -> Vulkan.

  1. OpenCL -> Vulkan: A Porting Guide (#1)
  2. OpenCL -> Vulkan: A Porting Guide (#2)
  3. OpenCL -> Vulkan: A Porting Guide (#3)

Instead of continuing this blog series, I’m converting the entire contents into a slide deck, and will be presenting it at the Munich Khronos Chapter meeting on the 13th of October 2017.

So please come along and watch myself, and the other great speakers, talk about some fun things you can do with Vulkan!

Look forward to seeing y’all there.

29 Jun

OpenCL -> Vulkan: A Porting Guide (#3)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is a series of blog posts on how to port from OpenCL to Vulkan:

  1. OpenCL -> Vulkan: A Porting Guide (#1)
  2. OpenCL -> Vulkan: A Porting Guide (#2)

In this post, we’ll cover the different queue synchronization mechanisms in OpenCL and Vulkan.

clFinish vs vkWaitForFences

In the previous post I explained that an OpenCL queue (cl_command_queue) was an amalgamation of two distinct concepts:

  1. A collection of workloads to run on some hardware
  2. A thing that will run various workloads and allow interactions between them

Whereas Vulkan uses a VkCommandBuffer for 1, and a VkQueue for 2.

One common synchronization users want to do is let a queue execute a bunch of work, and wait for all that work to be done.

In OpenCL, you can wait on all previously submitted commands to a queue by using clFinish.

cl_command_queue queue; // previously created

// submit work to the queue
if (CL_SUCCESS != clFinish(queue)) {
  // ... error!
}

In Vulkan, because a queue is just a thing to run workloads on, we instead have to wait on the command buffer itself to complete. This is done via a VkFence which is specified when submitting work to a VkQueue.

VkCommandBuffer commandBuffer; // previously created
VkFence fence; // previously created

// submit work to the commandBuffer

VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    fence)) {
  // ... error!
}

if (VK_SUCCESS != vkWaitForFences(
    device,
    1,
    &fence,
    VK_TRUE,
    UINT64_MAX)) {
  // ... error!
}

One thing to note is that you can wait on a Vulkan queue to finish all submitted workloads, but remember the difference between Vulkan queues and OpenCL queues. Vulkan queue’s are retrieved from a device. If multiple parts of your code (including third party libraries) retrieve the same Vulkan queue and are executing workloads on it, you will end up waiting for someone else’s work to complete.

TL;DR – waiting on a queue in Vulkan is not the same as OpenCL.

Dependencies within a cl_command_queue / VkCommandBuffer

Both OpenCL and Vulkan have mechanisms to ensure a command will only begin executing once another command has completed.

Firstly, remember that an OpenCL command queue by default will be in order. What this means is that by default when you submit commands into an OpenCL command queue each command will only begin executing once the preceding command has completed. While this isn’t ideal in a number of situations for performance, it is advantageous for users to get up and running in a safe and quick manner.

OpenCL also allows command queue’s to be out of order. This means that commands submitted to a queue are guaranteed to be dispatched in order but that they may run concurrently and/or complete out of order.

Using an out of order OpenCL queue, to get commands to wait on other commands before beginning executing, you use a cl_event to create a dependency between both the commands.

cl_buffer bufferA, bufferB, bufferC; // previously created

cl_event event;

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    bufferA,
    bufferB,
    0,
    0,
    42,
    0,
    nullptr,
    &event)) {
  // ... error!
}

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    bufferB,
    bufferC,
    0,
    0,
    42,
    1,
    &event,
    nullptr)) {
  // ... error!
}

We can guarantee that if queue above was an out of order queue, the commands would still be executed in order because we expressed the dependency between both commands.

In Vulkan queues are out of order. There is also no exact matching mechanism to get two arbitrary commands to depend on one another. Vulkan relies on more knowledge of what you are actually trying to do to create the right kind of synchronization between commands.

The easiest (and in no way more performant) way to map OpenCL code with an event dependency between two commands, or if the OpenCL queue was created in order, is to have separate Vulkan command buffers for each command. While this might seem crude, it’ll allow you to use another of Vulkan’s synchronization mechanisms to solve the problem – the semaphore.

VkBuffer bufferA, bufferB, bufferC; // previously created
VkCommandBuffer commandBuffer1; // previously created
VkCommandBuffer commandBuffer2; // previously created

VkSemaphoreCreateInfo semaphoreCreateInfo = {
  VK_STRUCTURE_TYPE_SEMAPHORE_CREATE_INFO,
  nullptr,
  0
};

VkSemaphore semaphore;

if (VK_SUCCESS != vkCreateSemaphore(
    device,
    &semaphoreCreateInfo,
    nullptr,
    &semaphore)) {
  // ... error!
}

VkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer1,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer1, bufferA, bufferB, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer1)) {
  // ... error!
}
VkSubmitInfo submitInfo1 = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer1,
  1,
  &semaphore,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo1,
    nullptr)) {
  // ... error!
}

VkPipelineStageFlags pipelineStageFlags =
    VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer2,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer2, bufferB, bufferC, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer2)) {
  // ... error!
}

VkSubmitInfo submitInfo2 = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  1,
  &semaphore,
  &pipelineStageFlags,
  1,
  &commandBuffer2,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo2,
    nullptr)) {
  // ... error!
}

A Vulkan semaphore allows you to express dependencies between command buffers. So by placing each command into a command buffer we can use a semaphore between these command buffers to emulate the OpenCL behaviour of in order queues and arbitrary command dependencies.

As with everything in Vulkan – the way to get performance is to explain to the driver exactly what you intend to do. In our example where we are copying data from buffer A -> buffer B -> buffer C above, we are basically creating a dependency on our usage of buffer B. The copy from buffer B -> buffer C cannot begin until the copy from buffer A -> buffer B has complete. So Vulkan gives us the tools to tell the driver about this dependency explicitly, and we can use them within a single command buffer.

The most analogous approach to the OpenCL example is to use a Vulkan event to encode the dependency.

VkEventCreateInfo eventCreateInfo = {
  VK_STRUCTURE_TYPE_EVENT_CREATE_INFO,
  nullptr,
  0
};

VkEvent event;

if (VK_SUCCESS != vkCreateEvent(
    device,
    &eventCreateInfo,
    nullptr,
    &event)) {
  // ... error!
}

Note that we create the event explicitly with Vulkan, unlike in OpenCL where any clEnqueue* command has an optional out_event parameter as the last parameter.

VkBuffer bufferA, bufferB, bufferC; // previously created
VkCommandBuffer commandBuffer; // previously created

VkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer, bufferA, bufferB, 1, &bufferCopy);

vkCmdSetEvent(
    commandBuffer, 
    event,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT);

VkMemoryBarrier memoryBarrier = {
  VK_STRUCTURE_TYPE_MEMORY_BARRIER,
  nullptr,
  VK_ACCESS_MEMORY_WRITE_BIT,
  VK_ACCESS_MEMORY_READ_BIT
};

vkCmdWaitEvents(
    commandBuffer,
    1,
    &event,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    1,
    &memoryBarrier,
    0,
    nullptr,
    0,
    nullptr)) {
  // ... error!
}

vkCmdCopyBuffer(commandBuffer, bufferB, bufferC, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer)) {
  // ... error!
}
VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    nullptr)) {
  // ... error!
}

So to do a similar thing to OpenCL’s event chaining semantics we:

  1. add our buffer A -> buffer B copy command
  2. set an event that will trigger when all previous commands are complete, in our case the current set of all previous commands is the one existing copy buffer command
  3. wait for the previous event to complete, specifying that all memory operations that performed a write before this wait must be resolved, and that all read operations after this event can read them
  4. add our buffer B -> buffer C copy command

Now we can be even more explicit with Vulkan and specifically use VK_ACCESS_TRANSFER_READ_BIT and VK_ACCESS_TRANSFER_WRITE_BIT – but I’m using the much more inclusive VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT to be clear what OpenCL will be doing implicitly for you as a user.

Dependencies between multiple cl_command_queue’s / VkCommandBuffer’s

When synchronizing between multiple cl_command_queue’s in OpenCL we use the exact same mechanism as with one queue.

cl_buffer bufferA, bufferB, bufferC; // previously created
cl_command_queue queue1, queue2; // previously created

cl_event event;

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue1,
    bufferA,
    bufferB,
    0,
    0,
    42,
    0,
    nullptr,
    &event)) {
  // ... error!
}

if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue2,
    bufferB,
    bufferC,
    0,
    0,
    42,
    1,
    &event,
    nullptr)) {
  // ... error!
}

The command queue queue2 will not begin executing the copy buffer command until the first command queue queue1 has completed its execution. Having the same mechanism for creating dependencies within a queue and outwith a queue is a very nice thing from a user perspective – there is one true way to create a synchronization between commands in OpenCL.

In Vulkan, when we are wanting to create a dependency between two VkCommandBuffer’s the easiest way is to use the semaphore approach I showed above. You could also use a VkEvent that is triggered at the end of one command buffer and waited on at the beginning of another. If you want to amortize the cost of doing multiple submits to the same queue, then use the event approach.

You can also use both of these mechanisms to create dependencies between multiple Vulkan queues. Remember that a Vulkan queue can be thought of as an exposition of some physical concurrency in the hardware, or in other words, running things on two distinct queues concurrently can lead to a performance improvement.

I recommend using a semaphore as the mechanism to encode dependencies between queues for the most part as it is simpler to get right.

The main place where using the event approach is when you have a long command buffer, where after only a few commands you can unblock the concurrently runnable queue to begin execution. In this case you’d be better using an event as that will enable the other queue to begin executing much earlier than would previously be possible.

clEnqueueBarrierWithWaitList vs vkCmdPipelineBarrier

Both OpenCL and Vulkan have a barrier that acts as a memory and execution barrier. When you have a pattern whereby you have N commands that must have completed execution before another M commands begin, a barrier is normally the answer.

// N commands before here...

if (CL_SUCCESS != clEnqueueBarrierWithWaitList(
    queue,
    0,
    nullptr,
    nullptr)) {
  // ... error!
}

// M commands after here will only begin once
// the previous N commands have completed!

And the corresponding Vulkan:

VkMemoryBarrier memoryBarrier = {
  VK_STRUCTURE_TYPE_MEMORY_BARRIER,
  nullptr,
  VK_ACCESS_MEMORY_WRITE_BIT,
  VK_ACCESS_MEMORY_READ_BIT
};

// N commands before here...

vkCmdPipelineBarrier(
    commandBuffer,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
    1,
    &memoryBarrier,
    0,
    nullptr,
    0,
    nullptr)) {
  // ... error!
}

// M commands after here will only begin once
// the previous N commands have completed!

What’s next?

After this monstrous dive into porting OpenCL’s synchronization mechanisms to Vulkan, in the next post we’ll look at the differences between OpenCL’s kernels and Vulkan’s pipelines – stay tuned!

16 Jun

OpenCL -> Vulkan: A Porting Guide (#2)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is a series of blog posts on how to port from OpenCL to Vulkan:

  1. OpenCL -> Vulkan: A Porting Guide (#1)

In this post, we’ll cover porting from OpenCL’s cl_command_queue to Vulkan’s VkQueue.

cl_command_queue -> VkCommandBuffer and VkQueue

OpenCL made a poor choice when cl_command_queue was designed. A cl_command_queue is an amalgamation of two very distinct things:

  1. A collection of workloads to run on some hardware
  2. A thing that will run various workloads and allow interactions between them

Vulkan broke this into the two constituent parts, for 1. we have a VkCommandBuffer, an encapsulation of one or more commands to run on a device. For 2. we have a VkQueue, the thing that will actually run these commands and allow us to synchronize on the result.

Without diving too deeply, Vulkan’s approach allows for a selection of commands to be built once, and then run multiple times. For a huge number of compute workloads we run on datasets, we’re running the same set of commands thousands of times – and Vulkan allows us to amortise the cost of building up this collection of commands to run.

Back to OpenCL, we use clCreateCommandQueue (for pre 2.0) / clCreateCommandQueueWithProperties to create this amalgamated ‘collection of things I want you to run and a way of running them’. We’ll enable CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE as that is the behaviour of a Vulkan VkQueue (although remember that not all OpenCL devices actually support out of order queues – I’m doing this to allow the mental mapping of how Vulkan executes command buffers on queues to bake into your mind).

cl_queue_properties queueProperties[3] = {
    CL_QUEUE_PROPERTIES,
    CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
    0
};

cl_command_queue queue = clCreateCommandQueueWithProperties(
    context,
    device,
    queueProperties,
    &errorcode);

if (CL_SUCCESS != errorcode) {
 // ... error!
}

The corresponding object in Vulkan is the VkQueue – which we get from the device, rather than creating as OpenCL does. This is because a queue in Vulkan is more like a physical aspect of the device, rather than some software construct – this isn’t mandated in the specification, but its a useful mental model to adopt when thinking about Vulkan’s queues.

Remember that when we created our VkDevice we requested which queue families we wanted to use with the device? Now to actually get a queue that supports compute, we have to choose one of the queue family indices that supported compute, and get the corresponding VkQueue from that queue family.

VkQueue queue;

uint32_t queueFamilyIndex = UINT32_MAX;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT & queueFamilyProperties[i].queueFlags) {
    queueFamilyIndex = i;
    break;
  }
}

if (UINT_MAX == queueFamilyIndex) {
  // ... error!
}

vkGetDeviceQueue(device, queueFamilyIndex, 0, &queue);

clEnqueue* vs vkCmd*

To actually execute something on a device, OpenCL uses commands that begin with clEnqueue* – this command will enqueue work onto a command queue and possibly begin execution it. Why possibly? OpenCL is utterly vague on when commands actually begin executing. The specification states that a call to clFlush, clFinish, or clWaitForEvents on an event that is being signalled by a previously enqueued command on a command queue will guarantee that the device has actually begun executing. It is entirely valid that an implementation begin executing work when the clEnqueue* command is called, and equally valid that the implementation delays until a bunch of clEnqueue* commands are in the queue and the corresponding clFlush/clFinish/clWaitForEvents is called.

cl_mem src, dst; // Two previously created buffers

cl_event event;
if (CL_SUCCESS != clEnqueueCopyBuffer(
    queue,
    src,
    dst,
    0, // src offset
    0, // dst offset
    42, // size in bytes to copy
    0,
    nullptr,
    &event)) {
  // ... error!
}

// If we were going to enqueue more stuff on the command queue,
// but wanted the above command to definitely begin execution,
// we'd call flush here.
if (CL_SUCCESS != clFlush(queue)) {
  // ... error!
}

// We could either call finish...
if (CL_SUCCESS != clFinish(queue)) {
  // ... error!
}

// ... or wait for the event we used!
if (CL_SUCCESS != clWaitForEvents(1, &event)) {
  // ... error!
}

In contrast, Vulkan requires us to submit all our commands into a VkCommandBuffer. First we need to create the command buffer.

VkCommandPoolCreateInfo commandPoolCreateInfo = {
  VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO,
  0,
  0,
  queueFamilyIndex
};

VkCommandPool commandPool;

if (VK_SUCCESS != vkCreateCommandPool(
    device,
    &commandPoolCreateInfo,
    0,
    &commandPool)) {
  // ... error!
}

VkCommandBufferAllocateInfo commandBufferAllocateInfo = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO,
  0,
  commandPool,
  VK_COMMAND_BUFFER_LEVEL_PRIMARY,
  1 // We are creating one command buffer.
};

VkCommandBuffer commandBuffer;

if (VK_SUCCESS != vkAllocateCommandBuffers(
    device,
    &commandBufferAllocateInfo,
    &commandBuffer)) {
  // ... error!
}

Now we have our command buffer with which we can queue up commands to execute on a Vulkan queue.

VkBuffer src, dst; // Two previously created buffers

VkCommandBufferBeginInfo commandBufferBeginInfo = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO,
  0,
  VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT,
  0
};

if (VK_SUCCESS != vkBeginCommandBuffer(
    commandBuffer,
    &commandBufferBeginInfo)) {
  // ... error!
}

vkBufferCopy bufferCopy = {
  0, // src offset
  0, // dst offset
  42 // size in bytes to copy
};

vkCmdCopyBuffer(commandBuffer, src, dst, 1, &bufferCopy);

if (VK_SUCCESS != vkEndCommandBuffer(commandBuffer)) {
  // ... error!
}

VkFenceCreateInfo fenceCreateInfo = {
  VK_STRUCTURE_TYPE_FENCE_CREATE_INFO,
  0,
  0
};

VkFence fence;

if (VK_SUCESS != VkFenceCreateInfo(
    device,
    &fenceCreateInfo,
    0,
    &fence)) {
  // ... error!
}

VkSubmitInfo submitInfo = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,
  0,
  0,
  0,
  0,
  1,
  &commandBuffer,
  0,
  0,
};

if (VK_SUCCESS != vkQueueSubmit(
    queue,
    1,
    &submitInfo,
    fence)) {
  // ... error!
}

// We can either wait on our commands to complete by fencing...
if (VK_SUCCESS != vkWaitForFences(
    device,
    1,
    &fence,
    VK_TRUE,
    UINT64_MAX)) {
  // ... error!
}

// ... or waiting for the entire queue to have finished...
if (VK_SUCCESS != vkQueueWaitIdle(queue)) {
  // ... error!
}

// ... or even for the entire device to be idle!
if (VK_SUCCESS != vkDeviceWaitIdle(device)) {
  // ... error!
}

Vulkan gives us many more ways to synchronize on host for when we are complete with our workload. We can specify a VkFence to our queue submission to wait on one of more command buffers in that submit, we can wait for the queue to be idle, or even wait for the entire device to be idle! Fences and command buffers can be reused by calling VkResetFences and VkResetCommandBuffer respectively – note that the command buffer can be reused for an entirely different set of commands to be executed. If you wanted to resubmit the exact same command buffer, you’d have to remove VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT flag in the VkCommandBufferBeginInfo struct above.

So a crucial thing to note here – synchronizing on a cl_command_queue is similar to a VkQueue, but the mechanisms are not identical.

We’ll cover these queue synchronization mechanisms in more detail in the next post in the series.

06 Jun

OpenCL -> Vulkan: A Porting Guide (#1)

Vulkan is the newest kid on the block when it comes to cross-platform, widely supported, GPGPU compute. Vulkan’s primacy as the high performance rendering API powering the latest versions of Android, coupled with Windows and Linux desktop drivers from all major vendors means that we have a good way to run compute workloads on a wide range of devices.

OpenCL is the venerable old boy of GPGPU these days – having been around since 2009. A huge variety of software projects have made use of OpenCL as their way to run compute workloads enabling them to speed up their applications.

Given Vulkan’s rising prominence, how does one port from OpenCL to Vulkan?

This is part 1 of my guide for how things map between the APIs!

cl_platform_id -> VkInstance

In OpenCL, the first thing you do is get the platform identifiers (using clGetPlatformIDs).

// We do not strictly need to initialize this to 0 (as it'll
// be set by clGetPlatformIDs), but given a lot people do
// not check the error code returns, it's safer to 0
// initialize.
cl_uint numPlatforms = 0;
if (CL_SUCCESS != clGetPlatformIDs(
    0,
    nullptr,
    &numPlatforms)) {
  // ... error!
}

std::vector<cl_platform_id> platforms(numPlatforms);

if (CL_SUCCESS != clGetPlatformIDs(
    platforms.size(),
    platforms.data(),
    nullptr)) {
  // ... error!
}

Each cl_platform_id is a handle into an individual vendors OpenCL driver – if you had an AMD and NVIDIA implementation of OpenCL on your system, you’d get two cl_platform_id’s returned.

Vulkan is different here – instead of getting one or more handles to individual vendors implementations, we instead create a single VkInstance (via vkCreateInstance).

const VkApplicationInfo applicationInfo = {
  VK_STRUCTURE_TYPE_APPLICATION_INFO,
  0,
  "MyAwesomeApplication",
  0,
  "",
  0,
  VK_MAKE_VERSION(1, 0, 0)
};
 
const VkInstanceCreateInfo instanceCreateInfo = {
  VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO,
  0,
  0,
  &applicationInfo,
  0,
  0,
  0,
  0
};
 
VkInstance instance;
if (VK_SUCCESS != vkCreateInstance(
    &instanceCreateInfo,
    0,
    &instance)) {
  // ... error!
}

This single instance allows us to access multiple vendor implementations of the Vulkan API through a single object.

cl_device_id -> VkPhysicalDevice

In OpenCL, you can query one or more cl_device_id’s from each cl_platform_id that we previously queried (via clGetDeviceIDs). When querying for a device, we can specify a cl_device_type, where you can basically ask the driver to give you its default device (normally a GPU) or for a specific device type. We’ll use CL_DEVICE_TYPE_ALL, in that we are instructing the driver to return all the devices it knows about, and we can choose from them.

cl_uint numDevices = 0;

for (cl_uint i = 0; i < platforms.size(); i++) {
  // We do not strictly need to initialize this to 0 (as it'll
  // be set by clGetDeviceIDs), but given a lot people do
  // not check the error code returns, it's safer to 0
  // initialize.
  cl_uint numDevicesForPlatform = 0;

  if (CL_SUCCESS != clGetDeviceIDs(
      platforms[i],
      CL_DEVICE_TYPE_ALL,
      0,
      nullptr,
      &numDevicesForPlatform)) {
    // ... error!
  }

  numDevices += numDevicesForPlatform;
}

std::vector<cl_device_id> devices(numDevices);

// reset numDevices as we'll use it for our insertion offset
numDevices = 0;

for (cl_uint i = 0; i < platforms.size(); i++) {
  cl_uint numDevicesForPlatform = 0;

  if (CL_SUCCESS != clGetDeviceIDs(
      platforms[i],
      CL_DEVICE_TYPE_ALL,
      0,
      nullptr,
      &numDevicesForPlatform)) {
    // ... error!
  }

  if (CL_SUCCESS != clGetDeviceIDs(
      platforms[i],
      CL_DEVICE_TYPE_ALL,
      numDevicesForPlatform,
      devices.data() + numDevices,
      nullptr)) {
    // ... error!
  }

  numDevices += numDevicesForPlatform;
}

The code above is a bit of a mouthful – but it is the easiest way to get every device that the system knows about.

In contrast, since Vulkan gave us a single VkInstance, we query that single instance for all of the VkPhysicalDevice’s it knows about (via vkEnumeratePhysicalDevices). A Vulkan physical device is a link to the actual hardware that the Vulkan code is going to execute on.

uint32_t physicalDeviceCount = 0;

if (VK_SUCCESS != vkEnumeratePhysicalDevices(
    instance,
    &physicalDeviceCount,
    0)) {
  // ... error!
}

std::vector<VkPhysicalDevice> physicalDevices(physicalDeviceCount);

if (VK_SUCCESS != vkEnumeratePhysicalDevices(
    instance,
    &physicalDeviceCount,
    physicalDevices.data())) {
  // ... error!
}

A prominent API design fork can be seen between vkEnumeratePhysicalDevices and clGetDeviceIDs – Vulkan reuses the integer return parameter to the function (the parameter that lets you query the number of physical devices present) to also pass into the driver the number of physical devices we want filled out. In contrast, OpenCL uses an extra parameter for this. These patterns are repeated throughout both APIs.

cl_context -> VkDevice

Here is where it gets trickier between the APIs. OpenCL has a notion of a context – you can think of this object as your way as the user to view and interact with what the system is doing. OpenCL allows multiple device’s that belong to a single platform to be shared within a context. In contrast, Vulkan is fixed to having a single physical device per it’s ‘context’, which Vulkan calls a VkDevice.

To make the porting easier, and because in all honesty I’ve yet to see any real use-case or benefit from having multiple OpenCL devices in a single context, we’ll make our OpenCL code create it’s cl_context using a single cl_device_id (via clCreateContext).

// One of the devices in our std::vector
cl_device_id device = ...;

cl_int errorcode;

cl_context context = clCreateContext(
    nullptr,
    1,
    &device,
    nullptr,
    nullptr,
    &errorcode);

if (CL_SUCCESS != errorcode) {
  // ... error!
}

The above highlights the single biggest travesty in the OpenCL API – the error code has changed from being something returned from the API call, to an optional pointer parameter at the end of the signature. In API design, I’d say this is rule #1 in how not to mess up an API (If you’re interested, these are two great API talks Designing and Evaluating Reusable Components by Casey Muratori and Hourglass Interfaces for C++ APIs by Stefanus Du Toit).

For Vulkan, when creating our VkDevice object, we specifically enable the features we want to use from the device upfront. The easy way to do this is to first call vkGetPhysicalDeviceFeatures, and then pass the result of this into our create device call, enabling all features that the device supports.

When creating our VkDevice, we need to explicitly request which queues we want to use. OpenCL has no real analogous concept to this – the naive comparison is to compare VkQueue’s against cl_command_queue’s, but I’ll show in a later post that this is a wrong conflation. Suffice to say, for our purposes we’ll query for all queues that support compute functionality, as that is almost what OpenCL is doing behind the scenes in the cl_context.

// One of the physical devices in our std::vector
VkPhysicalDevice physicalDevice = ...;

VkPhysicalDeviceFeatures physicalDeviceFeatures;

vkGetPhysicalDeviceFeatures(
    physicalDevice,
    physicalDeviceFeatures);

uint32_t queueFamilyPropertiesCount = 0;

vkGetPhysicalDeviceQueueFamilyProperties(
    physicalDevice,
    &queueFamilyPropertiesCount,
    0);

// Create a temporary std::vector to allow us to query for
// all the queue's our physical device supports.
std::vector<VkQueueFamilyProperties> queueFamilyProperties(
    queueFamilyPropertiesCount);

vkGetPhysicalDeviceQueueFamilyProperties(
    physicalDevice,
    &queueFamilyPropertiesCount,
    queueFamilyProperties.data());

uint32_t numQueueFamiliesThatSupportCompute = 0;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT &
      queueFamilyProperties[i].queueFlags) {
    numQueueFamiliesThatSupportCompute++;
  }
}

// Create a temporary std::vector to allow us to specify all
// queues on device creation
std::vector<VkDeviceQueueCreateInfo> queueCreateInfos(
    numQueueFamiliesThatSupportCompute);

// Reset so we can re-use as an index
numQueueFamiliesThatSupportCompute = 0;

for (uint32_t i = 0; i < queueFamilyProperties.size(); i++) {
  if (VK_QUEUE_COMPUTE_BIT &
      queueFamilyProperties[i].queueFlags) {
    const float queuePrioritory = 1.0f;

    const VkDeviceQueueCreateInfo deviceQueueCreateInfo = {
        VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO,
        0,
        0,
        i,
        1,
        &queuePrioritory
    };

    queueCreateInfos[numQueueFamiliesThatSupportCompute] =
        deviceQueueCreateInfo;

    numQueueFamiliesThatSupportCompute++;
  }
}

const VkDeviceCreateInfo deviceCreateInfo = {
    VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO,
    0,
    0,
    queueCreateInfos.size(),
    queueCreateInfos.data(),
    0,
    0,
    0,
    0,
    0
 };

VkDevice device;
if (VK_SUCCESS != vkCreateDevice(
    physicalDevice,
    &deviceCreateInfo,
    0,
    &device)) {
  // ... error!
}

Vulkan’s almost legendary verbosity strikes here – we’re having to write a lot more code than the equivalent in OpenCL to get an almost analogous handle. The plus here is that for the Vulkan driver, it can do a lot more upfront allocations because a much higher proportion of its state is known at creation time – that is the fundamental approach of Vulkan, we are trading upfront verbosity for a more efficient application overall.

Ok – so we’ve now got the API to the point where we can think about actually using the plethora of hardware available from these APIs! Stay tuned for the next in the series where I’ll cover porting from OpenCL’s cl_command_queue to Vulkan’s VkQueue.

11 Mar

Adding JSON 5 to json.h

I’ve added JSON 5 support to my json.h library.

For those not in the know, JSON 5 (http://json5.org/) is a modern update to the JSON standard, including some cool features like unquoted keys, single quoted keys and strings, hexdecimal numbers, Infinity and NaN numbers, and c style comments!

As is sticking with the design of my lib – each of the features can be turned on individually if you don’t want the full shebang, or just add json_parse_flags_allow_json5 to enable the entire feature set.

The GitHub pull request brings in the functionality, and it is merged into master too!

16 Oct

Adding loops (MPC -> LLVM for the Neil Language #5)

This is part of a series, the first four parts of the series can be found at:

  1. Hooking up MPC & LLVM
  2. Cleaning up the parser
  3. Adding type identifiers
  4. Adding branching

In this post, we’ll cover how to add loops to our little toy language I’m calling Neil – Not Exactly an Intermediate Language.

To keep things simple, I’ve decided to add loops of the form:

while (<expression> <comparison operator> <expression>) {
  <statement>*
}

Grammar Changes

We need to add a new kind of statement to the grammar, one for our while loops:

stmt : \"return\" <lexp>? ';' 
     | <ident> '(' <ident>? (',' <ident>)* ')' ';' 
     | <typeident> ('=' <lexp>)? ';' 
     | <ident> '=' <lexp> ';' 
     | \"if\" '(' <bexp> ')' '{' <stmt>* '}'
     | \"while\" '(' <bexp> ')' '{' <stmt>* '}' ;

And with this one change, because we have already handled boolean expression in the additions for branching, we can handle our loops.

How to Handle Loops

Loops are basically branching – the only caveat is that we are going to branch backwards to previous, already executed, basic blocks.

loops

For every while statement we create two new basic blocks. Whatever basic block we are in (in the above example one called ‘entry’) will then conditionally enter the loop by branching either to the ‘while_body’ block (that will contain any statements within the while loop), or by branching to the ‘while_merge’ basic block. Within the body of the loop, the ‘while_body’ basic block will then conditionally (based on the bexp part of the grammar change) loop back to itself, or to the ‘while_merge’. This means that all loops converge as the loop finishes – they will always execute ‘while_merge’ whether the loop is entered or not.

Handling Whiles

To handle while statements:

  • we get an LLVMValueRef for the boolean expression – using LLVMBuildICmp or LLVMBuildFCmp to do so
  • once we have our expression, we increment the scope as all symbols need to be in the new scope level
  • we create two new basic blocks, one for ‘while_body’ and one for ‘while_merge’
  • we use LLVMBuildCondBr to branch, based on the LLVMValueRef for the condition, to either ‘while_body’ or ‘while_merge’
  • we then set the LLVMBuilderRef that we are using to build in the ‘while_body’ basic block
  • then we lower the statements in the while statement (which will all be placed within the ‘while_body’ basic block)
  • and after all statements in the while statement have been processed, we re-evaluate the boolean expression for the while loop, then use LLVMBuildCondBr to conditionally branch to ‘while_merge’, or back to ‘while_body’ if the while loop had more iterations required
  • and lastly set the LLVMBuilderRef to add any new statements into the ‘while_merge’ basic block

And it really is that simple! All the changes we made previously to handle if statements meant that this was a really easy change to add to the language.

Result

Now our simple example looks like so:

i32 foo(i32 x) {
  i32 y = x * 5;
  while (y > 13) {
    if (y < 4) { i32 z = x; y = z; }
    y = y + 42;
  }
  return y;
}
i32 main() {
  return foo(13);
}

And turns into the following LLVM IR:

define i32 @foo(i32 %x) {
entry:
  %y = alloca i32
  %0 = mul i32 %x, 5
  store i32 %0, i32* %y
  %1 = load i32, i32* %y
  %2 = icmp sgt i32 %1, 13
  br i1 %2, label %while_body, label %while_merge

while_body:                     ; preds = %if_merge, %entry
  %3 = load i32, i32* %y
  %4 = icmp slt i32 %3, 4
  br i1 %4, label %if_true, label %if_merge

while_merge:                    ; preds = %if_merge, %entry
  %5 = load i32, i32* %y
  ret i32 %5

if_true:                        ; preds = %while_body
  %z = alloca i32
  store i32 %x, i32* %z
  %6 = load i32, i32* %z
  store i32 %6, i32* %y
  br label %if_merge

if_merge:                       ; preds = %if_true, %while_body
  %7 = load i32, i32* %y
  %8 = add i32 %7, 42
  store i32 %8, i32* %y
  %9 = load i32, i32* %y
  %10 = icmp sgt i32 %9, 13
  br i1 %10, label %while_body, label %while_merge
}

define i32 @main() {
entry:
  %0 = call i32 @foo(i32 13)
  ret i32 %0
}

You can check out the full GitHub pull request for the feature here.

In the next post, we’ll look into how we can add support for pointers to the language, stay tuned!

 

06 Oct

Adding branching (MPC -> LLVM for the Neil Language #4)

This is part of a series, the first three parts of the series can be found at:

  1. Hooking up MPC & LLVM
  2. Cleaning up the parser
  3. Adding type identifiers

In this post, we’ll cover how to add branching support to our little toy language I’m calling Neil – Not Exactly an Intermediate Language.

To keep things simple, I’ve decided to add branching of the form:

if (<expression> <comparison operator> <expression>) {
  <statement>*
}

With the following caveats:

  • we will not support else branches
  • we will only support <. <=. >, >=, == and != comparison operators

With that in mind, lets get adding it to Neil!

Grammar Changes

We need a new type of expression for our grammar – a boolean expression. This is an expression that evaluates to boolean (using a comparison operator).

bexp : <lexp>
       ('>' | '<' | \">=\" | \"<=\" | \"!=\" | \"==\")
       <lexp> ;                                               

Our boolean expression (bexp in the grammar) consists of a left expression (lexp), followed by one of the possible six supported comparison operators, followed by another lexp.

Now, we can modify statements in the grammar to add a new statement type for if statements.

stmt : \"return\" <lexp>? ';' 
     | <ident> '(' <ident>? (',' <ident>)* ')' ';' 
     | <typeident> ('=' <lexp>)? ';' 
     | <ident> '=' <lexp> ';' 
     | \"if\" '(' <bexp> ')' '{' <stmt>* '}' ;

And that is all the changes we need to the grammar.

How to Handle Branches

When handling branches, we are going to follow a really simple approach.

For every if statement, we’ll create two new basic blocks. Whatever basic block we are currently in will then conditionally branch between both blocks, in the true case it will branch to the ‘if_true’ block, and otherwise to the ‘if_merge’ block. Within the ‘if_true’ block, when it is has completed its conditional statements, it will always branch to the ‘if_merge’ block on exit. This has the really nice property that at the end of every branching sequence we always converge to exactly one active basic block for future statements.

Changing the Symbol Table

One thing of note is that in our Neil language, identifiers can be declared at any place a statement could be. This means that we are allowed to create variables within the if statement. The problem is that at present our symbol table assumes that all symbols declared within a function will be active for the duration of the function. We need to change our symbol table to be aware of the scope that a symbol currently inhabits. A really simple way to do this is to track which scope level a symbol was declared within.

  • when we enter a new function or if statement, we need to increment the scope
  • when we exit the function or if statement, we need to decrement the scope, and remove all symbols declared associated with that scope
  • when inserting symbols into the symbol table, they will be inserted at the current scope level

For simplicity, I use a std::vector of std::map’s, each map in the vector corresponding to a scope level. Then, when we are looking for a symbol we first look in the last element of the std::vector for the symbol, before iterating backwards through the vector. This allows us to reference symbols in higher scope levels too.

Handling Ifs

To handle if statements:

  • we get an LLVMValueRef for the boolean expression – using LLVMBuildICmp or LLVMBuildFCmp to do so
  • once we have our expression, we increment the scope as all symbols need to be in the new scope level
  • we create two new basic blocks, one for ‘if_true’ and one for ‘if_merge’
  • we use LLVMBuildCondBr to branch, based on the LLVMValueRef for the condition, to either ‘if_true’ or ‘if_merge’
  • we then set the LLVMBuilderRef that we are using to build in the ‘if_true’ basic block
  • then we lower the statements in the if statement (which will all be placed within the ‘if_true’ basic block)
  • and after all statements in the if statement have been processed, we use LLVMBuildBr to unconditionally branch to ‘if_merge’
  • and lastly set the LLVMBuilderRef to add any new statements into the ‘if_merge’ basic block

And that’s it! It’s actually quite a simple set of steps to get branches working when you break it down into the constituent parts.

Result

Now our simple example looks like so:

i32 foo(i32 x) {
  i32 y = x * 5;
  if (y < 4) { i32 z = x; y = z; }
  y = y + 42;
  return y;
}
i32 main() {
  return foo(13);
}

And turns into the following LLVM IR:

define i32 @foo(i32 %x) {
entry:
  %y = alloca i32
  %0 = mul i32 %x, 5
  store i32 %0, i32* %y
  %1 = load i32, i32* %y
  %2 = icmp slt i32 %1, 4
  br i1 %2, label %if_true, label %if_merge

if_true:                                    ; preds = %entry
  %z = alloca i32
  store i32 %x, i32* %z
  %3 = load i32, i32* %z
  store i32 %3, i32* %y
  br label %if_merge

if_merge:                                   ; preds = %if_true, %entry
  %4 = load i32, i32* %y
  %5 = add i32 %4, 42
  store i32 %5, i32* %y
  %6 = load i32, i32* %y
  ret i32 %6
}

define i32 @main() {
entry:
  %0 = call i32 @foo(i32 13)
  ret i32 %0
}

You can check out the full GitHub pull request for the feature here.

In the next post, we’ll look into how we can add the other form of useful branching to the language – loops! Stay tuned.

29 Sep

Introducing YARI-V – an experiment on SPIR-V compression

SPIR-V is a simple binary intermediate language used for graphics shaders and compute kernels. Wearing my work hat (I work at Codeplay Software Ltd.) I have been contributing to the SPIR-V specification since 2014 as one of the authors. SPIR-V’s primary goals are (as according to me):

  • Have a regular binary structure.
  • Be easily extendable.
  • Be easy to validate for correctness.
  • Be easy to produce from compiler toolchains.
  • Be easy to consume in tools and drivers.

To this end, one of the things that SPIR-V has not prioritised is the size of the resultant binaries. The awesome @aras_p wrote a great summary of the problem (and his tool SMOL-V) on his blog – SPIR-V Compression. The SMOL-V tool is a single C++ header/single C++ source file.

I’m a big fan of single C header libraries, and was curious if I could write a similar tool to his own, written in C, but try to use my knowledge of SPIR-V to get me a better compression ratio. In my previous blog posts ‘spirv-stats – a tool to output statistics of your SPIR-V shader modules‘ and ‘spirv-stats update – exposing more information‘ I tried to get an in-depth look into what is taking up the most space in the SPIR-V shaders that @aras_p was using for testing.

Then, I begun writing my own tool for compressing SPIR-V shaders that I’m calling YARI-V (a yari is a type of Japanese spear, which seemed appropriate as a sister encoding to SPEAR-V).

In the remainder of this post I’ll walk you through the steps I took to compress the SPIR-V shaders that @aras_p was using for testing, and compare and contrast the result of my own library YARI-V against SMOL-V.

Test Set

I didn’t have handy access to some real world shaders like @aras_p had for his SMOL-V tool – so I simply used the 341 shaders he uses to test SMOL-V against to test YARI-V against. The total size of the uncompressed shaders is 4868.47 kilobytes, and we’ll use a percentage on this size when evaluating the compression attempts that were made.

Varint Encoding

The first thing I thought to do was use the varint encoding used in Google’s Protocol Buffers for everything. For the uninitiated, SPIR-V is word based – everything is held in 32 bit values. IDs are in the range [1..N), and ID’s should start at 1 and increment from there as more IDs are required. This means that for small shaders, most IDs in use are going to be small unsigned integer numbers. Let’s take a look at the OpDecorate instruction for an example:

word count opcode <id> target decoration literal*
bytes 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

As we can see, the opcode is made up of:

  • two bytes for the word count
  • two bytes for the opcode
  • four bytes for the <id> to target this decoration with
  • four bytes for the decoration itself (an enumeration of values in the range [0..N))
  • four bytes for each optional literal (some decorations can take other values)

What I did was take the word count and varint encode that (this value is normally very low for opcodes) – the only opcodes that could have a word count greater than 127 (the magic cutoff to fit within 1 byte using varint encoding) are the ones that take strings (like OpString, OpName, OpMemberName). This meant that word count was taking 1 byte instead of 2 in most cases.

Next, I varint encoded the opcode. Most of the opcodes we use are below the 127 cutoff for varint encoding, so we can encode this as 1 byte instead of 2 again. The worst case to fit within 2 bytes using varint encoding is 16383, and our maximum opcode at present is in the 4000 range, so we shouldn’t ever require more bytes than the original encoding by using varint.

Next, the <id>. In small shaders it would be normal for all <id>’s to be less than 127, but even in large shaders most <id>’s are likely to be lower than the 2 byte boundary for varint – the value 16383. Given that <id> took 4 bytes all the time previously, we are saving at least 2 bytes in nearly all the cases we care about.

The decoration used in OpDecorate currently has [0..44) possible values, so this will always fit inside 1 byte of our varint encoding.

And any literal used by the decoration we’ll just varint and hope that it’ll be worth it.

After doing this for all opcodes in the SPIR-V shaders, we reduced our SPIR-V shader size to:

2279.97 kilobytes (46.8%)

So a pretty healthy start for reducing the size of the binaries!

OpLabel <-> OpNop

So after varint encoding, the next thing I looked at was the output from my spirv-stats tool. It showed that OpLabel was being used 9915 times in our shaders. OpLabel’s opcode value is 248 – which means it requires 2 bytes when varint encoded. So I decided to find another opcode whose value was less than 127 and could thus fit within a 1 byte varint value, that wasn’t being used within our shaders. OpNop has the value 0, is used no times within our shaders, so I decided to swap the values of these during encoding, and then swap them back during decoding.

The next thing I noticed was that OpLabel has a constant word count – the number of words it takes is the same for every use of the opcode. Given that this is constant, it means we can not encode the value, and simply infer the constant value of the word count during decoding.

2260.61 kilobytes 46.43%

This reduced the size by about 19 kilobytes.

Finding Moar Things to Swap

After realising that swapping <id>’s with greater than 127 value with non-used or little-used smaller than 128 values worked, I decided to go through the next set of most used opcodes (as found from spirv-stats) and swap them, and where possible not encode the word count of the opcode if it was a constant.

I first swapped OpFMul <-> OpSourceContinued, and OpFadd with OpSource:

2246.63 kilobytes 46.15%

Then I swapped OpBranch and OpSourceExtension, not encoding the word count of OpBranch:

2236.58 kilobytes 45.94%

Then I swapped OpFSub and OpUndef, not encoding OpFSub’s word count and also not encoding OpFAdd and OpFMul’s word count too:

2217.26 kilobytes 45.54%

In total another 43 kilobytes shaved off the size!

Delta Encoding

SPIR-V uses a compiler intermediate form known as Single-Static-Assignment (SSA for short) which means the results of opcodes are assigned to an <id> once, and that <id> is never reassigned to. This means that once we go over the 127 value boundary for an <id>, we are going to require 2 bytes for every subsequent <id> to be encoded.

For the most part, <id>’s will be linearly increasing the length of the program, EG. for the current opcode we can be quite confident that the previous opcode had the <id> of our <id> – 1. Given this fact, I delta encoded our <id> to the previous known <id>. There was a problem though – what if the previous <id> used was actually bigger than our <id>? This would result in the subtraction creating a large unsigned integer number, which would take 5 bytes to encode! To get round this, I used a lovely little bit twiddling hack called zig-zag encoding (used in Google’s Protocol Buffers, but explained really well here). Zig-zag encoding allows for all integers in the range [-64..64) to be encoded using one byte when combined with our varint encoding, meaning that even if the previous <id> was actually larger than our own, we would still hopefully be able to encode the delta from it to our own <id> in 1 or 2 bytes (rather than a worst case of 5 bytes).

I also thought I’d try delta encoding our types separately. In general the <id>’s assigned to types are close to each other in the SPIR-V shaders, because types are all declared in one section at the beginning of the shaders. So I thought by delta encoding the types I’d also get a nice little compression.

2111.88 kilobytes 43.38%

So doing this shaved a lovely 106 kilobytes off of YARI-V encoded size.

Never Encode a Constant Word Count

I’d already shown that not encoding the word count where possible would save us at least 1 byte per opcode we could do this for, so I did a pass over all the opcodes in SPIR-V to not encode the word count for all opcodes where the word count was a constant.

2081.48 kilobytes 42.75%

This shaved a further 30 kilobytes off our YARI-V encoded size.

Fake Opcodes

I was a little disappointed that never encoding the constant word count only knocked 30 kilobytes off of our encoded size – then I realised, the most used opcodes in our SPIR-V shaders are all variable length (as per the specification). But are they really? I added some output to spirv-stats to show when OpLoad and OpStore had the optional addition Memory Access literal – and it turns out exactly 0 of our OpLoad’s and OpStore’s used this! So for our purposes, OpLoad and OpStore had a constant word count, they just didn’t know it.

What I did was split OpLoad into two encoding, OpLoad, and a new fake opcode called OpLoadWithMemoryAccess. I set the value for OpLoadWithMemoryAccess to above 500 (the largest SPIR-V ID in use at present is in the low 300’s, so I hope this is safe enough for the time being), and then when encoding both our OpLoad and OpLoadWithMemoryAccess opcodes their word counts are constant (4 for OpLoad, 5 for OpLoadWithMemoryAccess). Doing this allowed me to save 1-2 bytes for each use of OpLoad (which accounts for 16% of the opcodes in our SPIR-V shaders!)

2032.74 kilobytes 41.75%

Next I did the same for OpStore, making a new fake opcode OpStoreWithMemoryAccess, and not encoding the now constant word count for OpStore and OpStoreWithMemoryAccess.

2004.19 41.17%

In total shaving 77 kilobytes off of our YARI-V encoded size.

More Decorations

OpDecorate is the third most used opcode with 8.28% of the total opcodes. I added some information to spirv-stats to output how many of the decorations had no literals and how many had one literal (none of the opcodes available today have more than one). 71% of the opcodes have no literals, and 29% have one. So I decided to split OpDecorate into three encodings, one that contains a decoration that has no literals, one that contains a decoration that has exactly one literal, and one that has two or more literals (to future proof the encoder). This allowed me to make all of our uses of OpDecorate have a constant word count, meaning we do not need to encode it. I also swapped these new fake opcodes with OpLine and OpExtension so their <id>’s were less than 127.

1996.46 kilobytes 41.01%

Shaving 8 kilobytes off of the YARI-V encoding.

Moar Member Decorations

Given the success of splitting OpDecorate, I decided to do the same with OpMemberDecorate, which is the sixth most used opcode in our SPIR-V shaders. 90% of the uses of OpMemberDecorate had 1 literal, so I decided I’d split it into three encodings (just like I did with OpDecorate), one that contains a decoration that has no literals, one that contains a decoration that has exactly one literal, and one that has two or more literals. I also swapping these new fake opcodes with OpExtImport and OpMemoryModel.

I also noticed that I wasn’t delta encoding the <id>’s for the new fake OpDecorate or OpMemberDecorate variants, so I did that too.

1967.65 kilobytes 40.42%

All of this resulted in shaving a further 29 kilobytes off of our YARI-V encoding.

(Non) Initialised Variables

I added a check to spirv-stats to see if any of the OpVariables we were declaring had initialisers – 0 of them did. So I added a new encoding for OpVariable that has an initializer, which meant I could skip encoding the word count.

1949.05 kilobytes 40.03%

I then applied the same logic to OpConstant – all of our constants were using one word for the actual constant (all of our constants were 32 bit integers and floats), so I could split out the encoding for OpConstant if it was encoding a 64 bit integer of double into a separate opcode, allowing me to not output the word count of our OpConstant’s.

1938.48 kilobytes 39.82%

Shaving 29 kilobytes off of our YARI-V encoding.

Access Chains

To get a pointer into a composite (say an array or struct) we use OpAccessChain to work out what we want to load. I added some information to spirv-stats to output the number of indices being used with OpAccessChain. 78% were using one index (say indexing into an array), 19% were using two indices (used if you were indexing into an array of structs), and 2% were using three indices.

I decided to split OpAccessChain into four encodings, one that contains one index, one that contains two indices, one that contains three indices, and one for all other index combinations. I also swapped these new fake opcodes with OpExecutionMode, OpCapability and OpTypeVoid)

1919.76 kilobytes 39.43%

Shaving 19 kilobytes off of our YARI-V encoding.

Everyday I’m Shuffling

OpVectorShuffle takes 3.4% of the opcodes in the SPIR-V shader module, but 6% of the size of the module (it’s a lot of bytes per opcode hit).

The first thing I noticed was that OpVectorShuffle was working on at most two vec4’s (the SPIR-V shaders I’m dealing with are used in Vulkan, where 4 element vectors are the maximum). So I decided to split OpVectorShuffle into four encodings; one that contains two components, one that contains three components, one that contains four components, and one for all other component combinations. I also swapped these with gaps in the SPIR-V opcode range at 8, 13 & 18 opcode values

1910.26 kilobytes 39.24%

Only 9 kilobytes shaved, which wasn’t so great. My next observation was that, when shuffling two vec4’s together, the maximum number of states each component literal could be in was 9, in the range [-1..8) – where -1 denotes that we’d want a undefined result in that component of the vector. I checked, and none of our encodings of OpVectorShuffle were using -1, so given that all of our literals are less than 8, we can use 3 bits maximum to encode each literal!. I extended the new OpVectorShuffle encodings I had previously made to encode the literals in at most 2 bytes (1 byte for the two literals case, 2 bytes for the three and four cases).

1892.43 kilobytes 38.87%

I next checked how many of our OpVectorShuffle’s were actually doing a swizzle – EG. they were taking the same vector <id> for both vectors, and were only accessing values from the first vector. A whopping 82% of our OpVectorShuffle’s were doing exactly this, so I added some new fake opcodes for OpVectorSwizzle, using 2 bits to encode each literal (in a swizzle at most 4 elements of a vec4 were being shuffled around, which can be encoded in 2 bits).

1874.17 kilobytes 38.50%

Shaving a cool 36 kilobytes off of our YARI-V encoded size.

Swapshop

I noticed that OpBranchConditional and OpSelectionMerge were being used enough that requiring 2 bytes to encode their opcodes was silly, so I swapped these with the CL-specific OpTypeEvent and OpTypeDeviceEvent for a further 8 kilobyte reduction in our YARI-V encoded size:

1868.01 kilobytes 38.37%

Composing

OpCompositeExtract and OpCompositeConstruct take a decent amount of space in the SPIR-V binary with 7% of the bytes dedicated to them.

I first split OpCompositeExtract into two encodings; one that has exactly one literal, and one for all other cases:

1859.28 kilobytes 38.19%

Then I split split OpCompositeConstruct into four encodings; one that has one constituent, one that has two constituents, one that has three constituents, and one for all other encodings:

1855.46 kilobytes 38.11%

Next, I noticed that OpCompositeExtract was being used mostly to lift a scalar from a vector for some scalar calculation. So I detected when OpCompositeExtract was being used with literals in the range [0..4), and added four encodings of OpCompositeExtract; one that assumes the literal is zero, one that assumes it is one, one that assumes it is two, and one that assumes it is three:

1843.40 kilobytes 37.86%

Relaxing Precisely While Decorating

The most used decoration of OpDecorate was RelaxedPrecision – with 66% of the 23770 uses of the opcode encoding that. So I added a new fake opcode for OpDecorateRelaxedPrecision, allowing me to not actually encode the decoration for RelaxedPrecision and skip the unnecessary byte.

1828.03 kilobytes 37.55%

I then used the same logic on OpMemberDecorate. The most used decoration with OpMemberDecorate was for Offset – accounting for 90% of the 14332 uses of the opcode. I added a new fake opcode for OpMemberDecorateOffset, to skip outputting the decoration in this most used case.

1816.17 kilobytes 37.30%

And with this I was really excited because I’d finally beaten @aras_p‘s SMOL-V (his was taking 1837.88 kilobytes for the shaders).

The Big Plot Twist

One thing I hadn’t been keeping an eye on (showing my newbieness to all things compression) was the compression ratio when passing YARI-V into something like zstd. SMOL-V is primarily a data filtering algorithm – it runs on a SPIR-V shader to create SMOL-V such that the SMOL-V is much more easily compressible than the SPIR-V was. My mistake was I was thinking of YARI-V solely as a compression format, and not as a filtering algorithm.

When I tested running zstd at level 20 encoding on YARI-V versus SMOL-V, YARI-V was taking 440 kilobytes compressed to SMOL-V’s 348 kilobytes! Even though the encoding of YARI-V was smaller than SMOL-V’s, SMOL-V was clearly filtering the data such that it made the compressors life easier.

I had to now work out how to increase repetition in my YARI-V encoding to help out compressors.

Delta Encoding More Things

I had previously only delta encoded the result <id> of my opcodes – but <id>’s are used in the body of the opcodes to. I looked at our three most used opcodes and started there.

For OpLoad and OpStore, I delta encoded the <id> that they were loading/storing from/to. This should result in a 1 byte encoding, as most OpLoad’s and OpStore’s are using the result <id> from an OpAccessChain to work out where to load from, and the access chain is usually the instruction immediately before the load or store.

1810.70 kilobytes 37.19%

Unvarinting Things

By using the varint encoding everywhere I was being a little over zealous with the use of varints. For example, if we were declaring an OpConstant that was a floating point value, as long as the floating point constant was not denormal it would always take 5 bytes to encode instead of the 4 bytes it would have taken if we hadn’t used varint encoding. So I added some logic to detect constants that had any bits set that would result in a 4 byte or larger varint encoding, and just memcpy’d these into the YARI-V encoding.

It turns out that 47% of our constants fitted this pattern, which saved one byte per constant encoded.

1805.71 kilobytes 37.09%

The Case of the Mistaken Delta Encoded Types

I still wasn’t anywhere near where I needed to be when compressed, and I was having trouble working out why I was still so far away. So I did some analysis of number of bytes taken up by delta encoding both our <id>’s and types. It turned out that our types were being delta encoded to 1 or 2 bytes which seemed reasonable at first glance. But then looking more closely, I realised that, since types are declared at the beginning of the SPIR-V shader modules, they mostly had low <id>’s assigned to them. In most cases the <id>’s were less than 127 for our types. This meant that instead of delta encoding them which was giving us roughly 50/50 for 1/2 byte encodings, if I just used varint encoding without delta encoding too, I’m getting nearly 95% of types encoded in 1 byte.

1675.96 kilobytes 34.42%

Shaving a whopping 125 kilobytes off of our resultant YARI-V encoding!

I then noticed that SMOL-V had an option to strip the non-essential <id>’s from the SPIR-V shader modules. This involves removing debug instructions like OpName, OpMemberName, OpLine, etc. that aren’t required to be present for the SPIR-V to function correctly.

I added in my own option on encoding to handle stripping of the debug instructions, which resulted in:

1498.67 kilobytes 32.41%

Which is a further 176 kilobytes smaller than our non-stripped YARI-V encoding, and 130 kilobytes smaller than SMOL-V’s equivalent stripped encoding.

At this stage I thought I was golden, I’d cracked the puzzle and obviously my YARI-V encoding would be smaller than SMOL-V? Oh how I was wrong!

Results

Approach Size (kilobytes) Compression (%)
SPIR-V 4868.468750 100.000000%
SPIR-V + zstd20 590.573242 12.130575%
SMOL-V 1837.881836 37.750717%
SMOL-V + zstd20 386.879883 7.946644%
YARI-V 1675.956055 34.424706%
YARI-V + zstd20 390.577148 8.022587%
SMOL-V(stripped) 1629.115234 33.462580%
SMOL-V(stripped) + zstd20 348.073242 7.149542%
YARI-V(stripped) 1498.666016 30.783108%
YARI-V(stripped) + zstd20 364.057617 7.477867%

The brass tax is that SMOL-V, with stripping, and then fed through zstd at level 20, is 16 kilobytes smaller than the equivalent YARI-V, stripped and fed through zstd at level 20, even though YARI-V is 130 kilobytes smaller than SMOL-V when comparing the results of YARI-V against SMOL-V directly.

My main suspicion is that my approach of creating new fake opcodes, and thus allowing me to avoid outputting the word count, is probably wrong. Only 9.74% of my opcodes required a word count in the end – but across the 286932 opcodes used in the input SPIR-V shaders this means that 27956 opcodes were using the more expensive approach to encoding our word count.

My other main idea is that it would seem that SMOL-V is creating a binary stream that zstd can more easily work out how to compress – more sequences of bits must be the same in SMOL-V as compared to YARI-V. I think my approach of simply trying to compress the input SPIR-V into as concise a form as I could with YARI-V meant I lost a little of the big picture that this should have been more of a filtering step on the SPIR-V rather than a compression algorithm in its own right.

Future Work

One thing I’d like to look at is if I could remap the <id>’s in the SPIR-V (guarded by an option) such that we could increase the delta encoding success rate. At present our delta encoded IDs take up:

bytes percentage of opcodes
1 58.180900%
2 27.674441%
3 14.144659%
4 0.000000%
5 0.000000%

At present 58% of the times we delta encode we get a value that will fit within 1 byte of our zig-zagged varint encoding. 27% fits within 2 bytes, and then 14% in 3 bytes. I think the best place to start would be to try and decrease the number of times a 3 byte encoding was required, and try to map <id>’s for locality.

A great example of where this would be useful is with OpConstant’s. OpConstant’s are declared early in the SPIR-V shader module and are therefore generally given a low <id>. But they tend to be used in the body of the functions, which occurs much later on. If an OpConstant was used by an OpFMul, it would be awesome if we could make the <id> of the OpConstant close to the <id> of OpFMul to increase our chances of a 1 byte delta encoding.

Getting YARI-V

YARI-V is available on GitHub licensed under the unlicense. I hope the code is useful to someone, even though I fell short of my aim I very much enjoyed the journey of trying.

25 Sep

spirv-stats update – exposing more information

In a previous post I introduced a little command line tool spirv-stats I’ve been working on. Since doing the initial version, I’ve extended the information the tool will give you based on some queries I had on the SPIR-V binaries we were using – in the GitHub pull request here.

For the most commonly used opcodes, I’ve tried to break them down to understand a little more about the shape of the opcodes.

OpLoad & OpStore

For OpLoad and OpStore – they have an optional additional parameter for a memory access. So I wondered, given the SPIR-V shaders we have as input, how many of the OpLoad’s and OpStore’s in the SPIR-V have the optional memory access literal? It turns out none of them do!

OpDecorate & OpMemberDecorate

For OpDecorate the first thing I wanted to know was how many of the decorations used had any additional literals. It turns out that 70% of OpDecorate’s have no additional literal, and the remaining 30% has one additional literal. The next query I had was what kind of decorations were mostly used in the SPIR-V shaders? It turns out the most used decoration was the RelaxedPrecision decoration with 66% of the uses of OpDecorate just for this one. The next most used was the Location decoration with 11%. I then extended these checks over to OpMemberDecorate, and it turns out that 90% of decorations on OpMemberDecorate have one literal! The reason is because a cool 84% of the decorations used on OpMemberDecorate are for encoding the Offset of struct members.

OpAccessChain

OpAccessChain can have an arbitrarily long set of IDs used to index into the pointer object. So I wondered how many of these were using a small number of indices? It turns out that 78% of the uses of OpAccessChain had only one index, 19% have two indices, and a mere 2% have three indices.

OpVariable

I wondered how many variables as used in our SPIR-V shaders had initializers (they have an initial value). None of them do! Of all 19041 uses of OpVariable not one had an initializer.

OpConstant

Of the constants used in the SPIR-V shaders, all of them use exactly one literal. This is unsurprising because int64/double types are not widely supported or used in shaders, but I wanted to be sure.

OpVectorShuffle

The first thing I wanted to know about OpVectorShuffle was how many literals were being used when shuffling the vectors – remember that the number of literals corresponds to the width of the output vector. It turns out that 31% of shuffles have two literals (a common case when extracting from a vec4 the indices into an image sample), 45% of shuffles have three literals, and 23% have four literals. The next question I had was to do with the undef literal that can be used in shuffle. 0xFFFFFFFFu (-1 in signed) can be used to signify that that element of the resulting vector is undefined. I wondered if the SPIR-V shaders we had were using this? It turns out none of them are (currently at least). The next question I had was how many shuffles were using literals lower than 4, and lower than 8. These two numbers would be if you were shuffling an individual vec4, or shuffling two vec4. 82% of the shuffles are using literals lower than 4 – so this could either be shuffling two vec2’s together, or one vec4. The next question then is how many OpShuffle’s are using the same vector ID in both vector components. This pattern is used when you actually only want to shuffle elements from the one vector. Well it turns out exactly 82% of shuffles were using both vectors the same!

OpCompositeExtract & OpCompositeConstruct

The last two opcodes that I have looked at currently were OpCompositeExtract and OpCompositeConstruct. For both I wondered how many what were the common number of literals being used? For extract, 97% were using exactly one literal. and 3% were using two literals. For construct, 17% used one literal, 41% used two literals, 41% used three literals. Also, for extract, I wondered how many of the extracts were being used to pull a single element out of a vector. So I checked how many times the literal was zero to three. Roughly 26% were accessing the zeroth, first or second, and 20% the third.

Sample Output

Below is a sample output run over the shaders that smol-v uses for testing. The changes in the latest version from the previous are highlighted in red.

Totals: 286932 hits 4985312 bytes
                        OpLoad[ 61] =  49915 hits (17.40%) 798640 bytes (16.02%)
                                    0 hits with memory access  0.00%
                       OpStore[ 62] =  29233 hits (10.19%) 350796 bytes ( 7.04%)
                                    0 hits with memory access  0.00%
                    OpDecorate[ 71] =  23770 hits ( 8.28%) 312964 bytes ( 6.28%)
                                16839 hits with no literals   70.84%
                                 6931 hits with 1 literal     29.16%
                                    0 hits with 2+ literals    0.00%
                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                15742 hits of decoration  0   66.23%
                                 1036 hits of decoration  2    4.36%
                                    1 hits of decoration  3    0.00%
                                  971 hits of decoration  6    4.08%
                                  206 hits of decoration 11    0.87%
                                    6 hits of decoration 14    0.03%
                                   17 hits of decoration 15    0.07%
                                   37 hits of decoration 18    0.16%
                                 2648 hits of decoration 30   11.14%
                                 1517 hits of decoration 33    6.38%
                                 1589 hits of decoration 34    6.68%
                 OpAccessChain[ 65] =  20116 hits ( 7.01%) 421496 bytes ( 8.45%)
                                    0 hits with 0 indices      0.00%
                                15768 hits with 1 index       78.39%
                                 3904 hits with 2 indices     19.41%
                                  442 hits with 3 indices      2.20%
                                    2 hits with 4 indices      0.01%
                                    0 hits with 5+ indices     0.00%
                    OpVariable[ 59] =  19041 hits ( 6.64%) 304656 bytes ( 6.11%)
                                    0 hits with initializer    0.00%
              OpMemberDecorate[ 72] =  14332 hits ( 4.99%) 280916 bytes ( 5.63%)
                                 1431 hits with no literals    9.98%
                                12901 hits with 1 literal     90.02%
                                    0 hits with 2+ literals    0.00%
                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                  850 hits of decoration  0    5.93%
                                  579 hits of decoration  4    4.04%
                                  579 hits of decoration  7    4.04%
                                  180 hits of decoration 11    1.26%
                                    2 hits of decoration 24    0.01%
                                12142 hits of decoration 35   84.72%
                    OpConstant[ 43] =  10823 hits ( 3.77%) 173168 bytes ( 3.47%)
                                10823 hits have 1 literal    100.00%
                                    0 hits have 2+ literals    0.00%
                       OpLabel[248] =   9915 hits ( 3.46%)  79320 bytes ( 1.59%)
               OpVectorShuffle[ 79] =   9732 hits ( 3.39%) 308372 bytes ( 6.19%)
                                    0 hits with 0 literals     0.00%
                                    0 hits with 1 literal      0.00%
                                 3045 hits with 2 literals    31.29%
                                 4405 hits with 3 literals    45.26%
                                 2282 hits with 4 literals    23.45%
                                    0 hits with 5+ literals    0.00%
                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                    0 hits with undef literal  0.00%
                                 7980 hits with literals < 4  82.00%
                                 1752 hits with literals < 8  18.00%
                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                 7980 hits with same vector   82.00%
            OpCompositeExtract[ 81] =   9595 hits ( 3.34%) 193220 bytes ( 3.88%)
                                    0 hits with 0 literals     0.00%
                                 9265 hits with 1 literal     96.56%
                                  330 hits with 2 literals     3.44%
                                    0 hits with 3+ literals   0.00%
                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                 2547 hits with literal =  0  26.55%
                                 2414 hits with literal =  1  25.16%
                                 2398 hits with literal =  2  24.99%
                                 1906 hits with literal =  3  19.86%
                                    0 hits with literal =  4   0.00%
                        OpName[  5] =   9233 hits ( 3.22%) 164092 bytes ( 3.29%)
                        OpFMul[133] =   8532 hits ( 2.97%) 170640 bytes ( 3.42%)
          OpCompositeConstruct[ 80] =   6678 hits ( 2.33%) 166680 bytes ( 3.34%)
                                    0 hits with 0 literals     0.00%
                                 1161 hits with 1 literal     17.39%
                                 2754 hits with 2 literals    41.24%
                                 2763 hits with 3 literals    41.37%
                                    0 hits with 4 literals     0.00%
                                    0 hits with 5+ literals    0.00%
                        OpFAdd[129] =   5922 hits ( 2.06%) 118440 bytes ( 2.38%)
                 OpTypePointer[ 32] =   5486 hits ( 1.91%)  87776 bytes ( 1.76%)
                     OpExtInst[ 12] =   5257 hits ( 1.83%) 145980 bytes ( 2.93%)
                      OpBranch[249] =   5229 hits ( 1.82%)  41832 bytes ( 0.84%)
           OpBranchConditional[250] =   3193 hits ( 1.11%)  51088 bytes ( 1.02%)
              OpSelectionMerge[247] =   3109 hits ( 1.08%)  37308 bytes ( 0.75%)
                        OpFSub[131] =   2668 hits ( 0.93%)  53360 bytes ( 1.07%)
                  OpMemberName[  6] =   2507 hits ( 0.87%)  78044 bytes ( 1.57%)
                OpFunctionCall[ 57] =   2198 hits ( 0.77%)  58520 bytes ( 1.17%)
           OpConstantComposite[ 44] =   2155 hits ( 0.75%)  50928 bytes ( 1.02%)
           OpFunctionParameter[ 55] =   2117 hits ( 0.74%)  25404 bytes ( 0.51%)
                         OpDot[148] =   1911 hits ( 0.67%)  38220 bytes ( 0.77%)
           OpVectorTimesScalar[142] =   1488 hits ( 0.52%)  29760 bytes ( 0.60%)
                    OpFunction[ 54] =   1398 hits ( 0.49%)  27960 bytes ( 0.56%)
                 OpFunctionEnd[ 56] =   1398 hits ( 0.49%)   5592 bytes ( 0.11%)
                OpTypeFunction[ 33] =   1175 hits ( 0.41%)  21076 bytes ( 0.42%)
                  OpTypeVector[ 23] =   1110 hits ( 0.39%)  17760 bytes ( 0.36%)
                  OpTypeStruct[ 30] =   1065 hits ( 0.37%)  58496 bytes ( 1.17%)
                   OpTypeArray[ 28] =   1038 hits ( 0.36%)  16608 bytes ( 0.33%)
                     OpFNegate[127] =   1038 hits ( 0.36%)  16608 bytes ( 0.33%)
      OpImageSampleImplicitLod[ 87] =    969 hits ( 0.34%)  19660 bytes ( 0.39%)
                        OpFDiv[136] =    961 hits ( 0.33%)  19220 bytes ( 0.39%)
                 OpReturnValue[254] =    928 hits ( 0.32%)   7424 bytes ( 0.15%)
                   OpFOrdEqual[180] =    722 hits ( 0.25%)  14440 bytes ( 0.29%)
                     OpTypeInt[ 21] =    661 hits ( 0.23%)  10576 bytes ( 0.21%)
           OpFOrdLessThanEqual[188] =    595 hits ( 0.21%)  11900 bytes ( 0.24%)
                OpFOrdLessThan[184] =    588 hits ( 0.20%)  11760 bytes ( 0.24%)
                      OpIEqual[170] =    586 hits ( 0.20%)  11720 bytes ( 0.24%)
                      OpReturn[253] =    525 hits ( 0.18%)   2100 bytes ( 0.04%)
                        OpIAdd[128] =    465 hits ( 0.16%)   9300 bytes ( 0.19%)
                   OpTypeImage[ 25] =    437 hits ( 0.15%)  15732 bytes ( 0.32%)
            OpTypeSampledImage[ 27] =    437 hits ( 0.15%)   5244 bytes ( 0.11%)
        OpFOrdGreaterThanEqual[190] =    412 hits ( 0.14%)   8240 bytes ( 0.17%)
             OpFOrdGreaterThan[186] =    391 hits ( 0.14%)   7820 bytes ( 0.16%)
      OpImageSampleExplicitLod[ 88] =    376 hits ( 0.13%)  11128 bytes ( 0.22%)
                  OpCapability[ 17] =    372 hits ( 0.13%)   2976 bytes ( 0.06%)
                 OpMemoryModel[ 14] =    341 hits ( 0.12%)   4092 bytes ( 0.08%)
                  OpEntryPoint[ 15] =    341 hits ( 0.12%)  17808 bytes ( 0.36%)
                    OpTypeVoid[ 19] =    341 hits ( 0.12%)   2728 bytes ( 0.05%)
               OpExtInstImport[ 11] =    341 hits ( 0.12%)   8184 bytes ( 0.16%)
                   OpTypeFloat[ 22] =    341 hits ( 0.12%)   4092 bytes ( 0.08%)
                  OpLogicalAnd[167] =    331 hits ( 0.12%)   6620 bytes ( 0.13%)
                         OpPhi[245] =    281 hits ( 0.10%)   7868 bytes ( 0.16%)
                  OpTypeMatrix[ 24] =    255 hits ( 0.09%)   4080 bytes ( 0.08%)
               OpExecutionMode[ 16] =    235 hits ( 0.08%)   2852 bytes ( 0.06%)
  OpImageSampleDrefExplicitLod[ 90] =    226 hits ( 0.08%)   7232 bytes ( 0.15%)
                    OpTypeBool[ 20] =    212 hits ( 0.07%)   1696 bytes ( 0.03%)
           OpVectorTimesMatrix[144] =    194 hits ( 0.07%)   3880 bytes ( 0.08%)
             OpSourceExtension[  4] =    167 hits ( 0.06%)   5732 bytes ( 0.11%)
                  OpLogicalNot[168] =    160 hits ( 0.06%)   2560 bytes ( 0.05%)
                      OpSource[  3] =    141 hits ( 0.05%)   1692 bytes ( 0.03%)
                        OpIMul[132] =    135 hits ( 0.05%)   2700 bytes ( 0.05%)
                 OpConvertSToF[111] =    116 hits ( 0.04%)   1856 bytes ( 0.04%)
                        OpFMod[141] =    114 hits ( 0.04%)   2280 bytes ( 0.05%)
                   OpLogicalOr[166] =     93 hits ( 0.03%)   1860 bytes ( 0.04%)
                   OpSLessThan[177] =     92 hits ( 0.03%)   1840 bytes ( 0.04%)
                   OpLoopMerge[246] =     84 hits ( 0.03%)   1344 bytes ( 0.03%)
                      OpSelect[169] =     68 hits ( 0.02%)   1632 bytes ( 0.03%)
                 OpConvertFToS[110] =     67 hits ( 0.02%)   1072 bytes ( 0.02%)
                     OpBitcast[124] =     66 hits ( 0.02%)   1056 bytes ( 0.02%)
           OpMatrixTimesVector[145] =     48 hits ( 0.02%)    960 bytes ( 0.02%)
            OpShiftLeftLogical[196] =     46 hits ( 0.02%)    920 bytes ( 0.02%)
                OpFOrdNotEqual[182] =     43 hits ( 0.01%)    860 bytes ( 0.02%)
                        OpKill[252] =     40 hits ( 0.01%)    160 bytes ( 0.00%)
           OpSGreaterThanEqual[175] =     39 hits ( 0.01%)    780 bytes ( 0.02%)
           OpMatrixTimesScalar[143] =     32 hits ( 0.01%)    640 bytes ( 0.01%)
              OpSLessThanEqual[179] =     21 hits ( 0.01%)    420 bytes ( 0.01%)
                        OpISub[130] =     21 hits ( 0.01%)    420 bytes ( 0.01%)
           OpMatrixTimesMatrix[146] =     15 hits ( 0.01%)    300 bytes ( 0.01%)
                        OpSDiv[135] =     12 hits ( 0.00%)    240 bytes ( 0.00%)
                      OpFwidth[209] =      8 hits ( 0.00%)    128 bytes ( 0.00%)
                 OpConvertUToF[112] =      6 hits ( 0.00%)     96 bytes ( 0.00%)
                   OpTranspose[ 84] =      6 hits ( 0.00%)     96 bytes ( 0.00%)
                  OpEmitVertex[218] =      5 hits ( 0.00%)     20 bytes ( 0.00%)
                   OpINotEqual[171] =      5 hits ( 0.00%)    100 bytes ( 0.00%)
               OpConstantFalse[ 42] =      5 hits ( 0.00%)     60 bytes ( 0.00%)
                OpConstantTrue[ 41] =      5 hits ( 0.00%)     60 bytes ( 0.00%)
                         OpAny[154] =      4 hits ( 0.00%)     64 bytes ( 0.00%)
              OpControlBarrier[224] =      4 hits ( 0.00%)     64 bytes ( 0.00%)
             OpLogicalNotEqual[165] =      3 hits ( 0.00%)     60 bytes ( 0.00%)
                     OpSNegate[126] =      3 hits ( 0.00%)     48 bytes ( 0.00%)
        OpVectorExtractDynamic[ 77] =      3 hits ( 0.00%)     60 bytes ( 0.00%)
                OpEndPrimitive[219] =      2 hits ( 0.00%)      8 bytes ( 0.00%)
                        OpDPdy[208] =      2 hits ( 0.00%)     32 bytes ( 0.00%)
                        OpDPdx[207] =      2 hits ( 0.00%)     32 bytes ( 0.00%)
                   OpULessThan[176] =      2 hits ( 0.00%)     40 bytes ( 0.00%)
                        OpUMod[137] =      2 hits ( 0.00%)     40 bytes ( 0.00%)
                OpSGreaterThan[173] =      1 hits ( 0.00%)     20 bytes ( 0.00%)
             OpCompositeInsert[ 82] =      1 hits ( 0.00%)     24 bytes ( 0.00%)
            OpTypeRuntimeArray[ 29] =      1 hits ( 0.00%)     12 bytes ( 0.00%)
                       OpUndef[  1] =      1 hits ( 0.00%)     12 bytes ( 0.00%)
21 Sep

spirv-stats – a tool to output statistics of your SPIR-V shader modules

I’ve just released a small one C++ file tool called spirv-stats. It will take one or more SPIR-V input files, and calculate the composition of the SPIR-V shader modules like so:

Totals: 35 hits 564 bytes
               OpDecorate =      5 hits (14.29%) 80 bytes (14.18%)
            OpTypePointer =      4 hits (11.43%) 64 bytes (11.35%)
               OpVariable =      4 hits (11.43%) 64 bytes (11.35%)
                   OpLoad =      3 hits ( 8.57%) 48 bytes ( 8.51%)
             OpTypeVector =      2 hits ( 5.71%) 32 bytes ( 5.67%)
          OpExtInstImport =      1 hits ( 2.86%) 24 bytes ( 4.26%)
            OpMemoryModel =      1 hits ( 2.86%) 12 bytes ( 2.13%)
             OpEntryPoint =      1 hits ( 2.86%) 32 bytes ( 5.67%)
          OpExecutionMode =      1 hits ( 2.86%) 12 bytes ( 2.13%)
             OpCapability =      1 hits ( 2.86%)  8 bytes ( 1.42%)
               OpTypeVoid =      1 hits ( 2.86%)  8 bytes ( 1.42%)
              OpTypeFloat =      1 hits ( 2.86%) 12 bytes ( 2.13%)
              OpTypeImage =      1 hits ( 2.86%) 36 bytes ( 6.38%)
       OpTypeSampledImage =      1 hits ( 2.86%) 12 bytes ( 2.13%)
           OpTypeFunction =      1 hits ( 2.86%) 12 bytes ( 2.13%)
               OpFunction =      1 hits ( 2.86%) 20 bytes ( 3.55%)
            OpFunctionEnd =      1 hits ( 2.86%)  4 bytes ( 0.71%)
                  OpStore =      1 hits ( 2.86%) 12 bytes ( 2.13%)
 OpImageSampleImplicitLod =      1 hits ( 2.86%) 20 bytes ( 3.55%)
                   OpFMul =      1 hits ( 2.86%) 20 bytes ( 3.55%)
                  OpLabel =      1 hits ( 2.86%)  8 bytes ( 1.42%)
                 OpReturn =      1 hits ( 2.86%)  4 bytes ( 0.71%)

It firstly outputs the total number of hits in the SPIR-V shader module(s) – this is the total number of opcodes found within the module(s). Then it outputs the total byte size of the module(s), followed by a sorted breakdown of the module(s), with the most hit opcodes coming first.

I decided to run this across the various folders of SPIR-V that @aras_p is using in his smol-v tool, with the following results.

dota2

Totals: 57090 hits 1099348 bytes
              OpMemberDecorate =  10907 hits (19.10%) 215824 bytes (19.63%)
                        OpLoad =   9140 hits (16.01%) 146240 bytes (13.30%)
                 OpAccessChain =   6034 hits (10.57%) 127080 bytes (11.56%)
                    OpVariable =   3273 hits ( 5.73%)  52368 bytes ( 4.76%)
                    OpDecorate =   2923 hits ( 5.12%)  44556 bytes ( 4.05%)
                       OpStore =   2701 hits ( 4.73%)  32412 bytes ( 2.95%)
                    OpConstant =   2403 hits ( 4.21%)  38448 bytes ( 3.50%)
                        OpFMul =   2398 hits ( 4.20%)  47960 bytes ( 4.36%)
          OpCompositeConstruct =   2053 hits ( 3.60%)  49788 bytes ( 4.53%)
                 OpTypePointer =   1952 hits ( 3.42%)  31232 bytes ( 2.84%)
                        OpFAdd =   1669 hits ( 2.92%)  33380 bytes ( 3.04%)
               OpVectorShuffle =   1371 hits ( 2.40%)  41800 bytes ( 3.80%)
                     OpExtInst =   1259 hits ( 2.21%)  35060 bytes ( 3.19%)
                        OpFSub =    925 hits ( 1.62%)  18500 bytes ( 1.68%)
                         OpDot =    894 hits ( 1.57%)  17880 bytes ( 1.63%)
           OpConstantComposite =    736 hits ( 1.29%)  18096 bytes ( 1.65%)
                  OpTypeStruct =    573 hits ( 1.00%)  44184 bytes ( 4.02%)
                       OpLabel =    542 hits ( 0.95%)   4336 bytes ( 0.39%)
                  OpTypeVector =    406 hits ( 0.71%)   6496 bytes ( 0.59%)
                   OpTypeArray =    403 hits ( 0.71%)   6448 bytes ( 0.59%)
      OpImageSampleImplicitLod =    316 hits ( 0.55%)   6320 bytes ( 0.57%)
      OpImageSampleExplicitLod =    300 hits ( 0.53%)   9000 bytes ( 0.82%)
            OpCompositeExtract =    257 hits ( 0.45%)   5140 bytes ( 0.47%)
                      OpBranch =    249 hits ( 0.44%)   1992 bytes ( 0.18%)
            OpTypeSampledImage =    225 hits ( 0.39%)   2700 bytes ( 0.25%)
                   OpTypeImage =    225 hits ( 0.39%)   8100 bytes ( 0.74%)
                        OpFDiv =    212 hits ( 0.37%)   4240 bytes ( 0.39%)
                     OpTypeInt =    210 hits ( 0.37%)   3360 bytes ( 0.31%)
  OpImageSampleDrefExplicitLod =    207 hits ( 0.36%)   6624 bytes ( 0.60%)
           OpBranchConditional =    155 hits ( 0.27%)   2480 bytes ( 0.23%)
              OpSelectionMerge =    153 hits ( 0.27%)   1836 bytes ( 0.17%)
                  OpCapability =    131 hits ( 0.23%)   1048 bytes ( 0.10%)
                OpFOrdLessThan =    131 hits ( 0.23%)   2620 bytes ( 0.24%)
                   OpTypeFloat =    110 hits ( 0.19%)   1320 bytes ( 0.12%)
                    OpTypeVoid =    110 hits ( 0.19%)    880 bytes ( 0.08%)
                 OpFunctionEnd =    110 hits ( 0.19%)    440 bytes ( 0.04%)
                  OpEntryPoint =    110 hits ( 0.19%)   6240 bytes ( 0.57%)
                 OpMemoryModel =    110 hits ( 0.19%)   1320 bytes ( 0.12%)
               OpExtInstImport =    110 hits ( 0.19%)   2640 bytes ( 0.24%)
                      OpReturn =    110 hits ( 0.19%)    440 bytes ( 0.04%)
                    OpFunction =    110 hits ( 0.19%)   2200 bytes ( 0.20%)
                OpTypeFunction =    110 hits ( 0.19%)   1320 bytes ( 0.12%)
                  OpTypeMatrix =     92 hits ( 0.16%)   1472 bytes ( 0.13%)
                        OpName =     87 hits ( 0.15%)   1368 bytes ( 0.12%)
                    OpTypeBool =     67 hits ( 0.12%)    536 bytes ( 0.05%)
             OpFOrdGreaterThan =     67 hits ( 0.12%)   1340 bytes ( 0.12%)
               OpExecutionMode =     65 hits ( 0.11%)    780 bytes ( 0.07%)
                      OpSelect =     56 hits ( 0.10%)   1344 bytes ( 0.12%)
                 OpConvertSToF =     48 hits ( 0.08%)    768 bytes ( 0.07%)
                     OpBitcast =     44 hits ( 0.08%)    704 bytes ( 0.06%)
            OpShiftLeftLogical =     44 hits ( 0.08%)    880 bytes ( 0.08%)
                        OpIAdd =     44 hits ( 0.08%)    880 bytes ( 0.08%)
                  OpMemberName =     34 hits ( 0.06%)    832 bytes ( 0.08%)
                        OpKill =     28 hits ( 0.05%)    112 bytes ( 0.01%)
                        OpFMod =     17 hits ( 0.03%)    340 bytes ( 0.03%)
                 OpConvertFToS =     16 hits ( 0.03%)    256 bytes ( 0.02%)
                   OpFOrdEqual =     11 hits ( 0.02%)    220 bytes ( 0.02%)
             OpSourceExtension =     10 hits ( 0.02%)    360 bytes ( 0.03%)
                      OpSource =     10 hits ( 0.02%)    120 bytes ( 0.01%)
           OpVectorTimesScalar =      8 hits ( 0.01%)    160 bytes ( 0.01%)
                     OpFNegate =      7 hits ( 0.01%)    112 bytes ( 0.01%)
        OpVectorExtractDynamic =      3 hits ( 0.01%)     60 bytes ( 0.01%)
                  OpLogicalNot =      2 hits ( 0.00%)     32 bytes ( 0.00%)
                 OpConvertUToF =      2 hits ( 0.00%)     32 bytes ( 0.00%)
                   OpLoopMerge =      2 hits ( 0.00%)     32 bytes ( 0.00%)
           OpFOrdLessThanEqual =      2 hits ( 0.00%)     40 bytes ( 0.00%)
                   OpLogicalOr =      1 hits ( 0.00%)     20 bytes ( 0.00%)

OpMemberDecorate dominates the dota2 shaders – nearly 20% of the module is decorating members of structs! Next we have OpLoad at 16% of the hits, but with 13% of the size of the files, followed by OpAccessChain at 10% of the hits and 12% of the size. Most of the shader module(s) are taken up with decorating struct members, and then loading and storing to various variables.

shadertoy

Totals: 75544 hits 1170540 bytes
                        OpLoad =  13486 hits (17.85%) 215776 bytes (18.43%)
                       OpStore =  10671 hits (14.13%) 128052 bytes (10.94%)
                       OpLabel =   7005 hits ( 9.27%)  56040 bytes ( 4.79%)
                        OpName =   5718 hits ( 7.57%)  93332 bytes ( 7.97%)
                    OpVariable =   5503 hits ( 7.28%)  88048 bytes ( 7.52%)
                      OpBranch =   4010 hits ( 5.31%)  32080 bytes ( 2.74%)
                    OpConstant =   3059 hits ( 4.05%)  48944 bytes ( 4.18%)
           OpBranchConditional =   2523 hits ( 3.34%)  40368 bytes ( 3.45%)
              OpSelectionMerge =   2469 hits ( 3.27%)  29628 bytes ( 2.53%)
                 OpAccessChain =   2171 hits ( 2.87%)  43696 bytes ( 3.73%)
                     OpExtInst =   1736 hits ( 2.30%)  47980 bytes ( 4.10%)
                        OpFMul =   1495 hits ( 1.98%)  29900 bytes ( 2.55%)
                OpFunctionCall =   1308 hits ( 1.73%)  38372 bytes ( 3.28%)
                        OpFAdd =   1230 hits ( 1.63%)  24600 bytes ( 2.10%)
                        OpFSub =   1027 hits ( 1.36%)  20540 bytes ( 1.75%)
           OpFunctionParameter =    960 hits ( 1.27%)  11520 bytes ( 0.98%)
           OpConstantComposite =    918 hits ( 1.22%)  20852 bytes ( 1.78%)
          OpCompositeConstruct =    869 hits ( 1.15%)  20120 bytes ( 1.72%)
                   OpFOrdEqual =    702 hits ( 0.93%)  14040 bytes ( 1.20%)
           OpVectorTimesScalar =    686 hits ( 0.91%)  13720 bytes ( 1.17%)
               OpVectorShuffle =    593 hits ( 0.78%)  17828 bytes ( 1.52%)
           OpFOrdLessThanEqual =    591 hits ( 0.78%)  11820 bytes ( 1.01%)
                      OpIEqual =    582 hits ( 0.77%)  11640 bytes ( 0.99%)
            OpCompositeExtract =    427 hits ( 0.57%)   8540 bytes ( 0.73%)
                    OpFunction =    415 hits ( 0.55%)   8300 bytes ( 0.71%)
                 OpFunctionEnd =    415 hits ( 0.55%)   1660 bytes ( 0.14%)
                        OpFDiv =    380 hits ( 0.50%)   7600 bytes ( 0.65%)
                 OpReturnValue =    342 hits ( 0.45%)   2736 bytes ( 0.23%)
                  OpLogicalAnd =    330 hits ( 0.44%)   6600 bytes ( 0.56%)
        OpFOrdGreaterThanEqual =    325 hits ( 0.43%)   6500 bytes ( 0.56%)
                OpFOrdLessThan =    289 hits ( 0.38%)   5780 bytes ( 0.49%)
                         OpPhi =    278 hits ( 0.37%)   7784 bytes ( 0.66%)
                OpTypeFunction =    268 hits ( 0.35%)   5808 bytes ( 0.50%)
             OpFOrdGreaterThan =    246 hits ( 0.33%)   4920 bytes ( 0.42%)
                        OpIAdd =    244 hits ( 0.32%)   4880 bytes ( 0.42%)
                 OpTypePointer =    244 hits ( 0.32%)   3904 bytes ( 0.33%)
              OpMemberDecorate =    180 hits ( 0.24%)   3600 bytes ( 0.31%)
                  OpMemberName =    180 hits ( 0.24%)   4320 bytes ( 0.37%)
                    OpDecorate =    162 hits ( 0.21%)   2520 bytes ( 0.22%)
                      OpReturn =    127 hits ( 0.17%)    508 bytes ( 0.04%)
                  OpLogicalNot =    122 hits ( 0.16%)   1952 bytes ( 0.17%)
                         OpDot =    103 hits ( 0.14%)   2060 bytes ( 0.18%)
                        OpFMod =     97 hits ( 0.13%)   1940 bytes ( 0.17%)
      OpImageSampleImplicitLod =     96 hits ( 0.13%)   2200 bytes ( 0.19%)
                   OpLogicalOr =     88 hits ( 0.12%)   1760 bytes ( 0.15%)
                     OpFNegate =     83 hits ( 0.11%)   1328 bytes ( 0.11%)
                   OpSLessThan =     74 hits ( 0.10%)   1480 bytes ( 0.13%)
                 OpConvertSToF =     67 hits ( 0.09%)   1072 bytes ( 0.09%)
                  OpTypeVector =     58 hits ( 0.08%)    928 bytes ( 0.08%)
                   OpLoopMerge =     54 hits ( 0.07%)    864 bytes ( 0.07%)
                 OpConvertFToS =     43 hits ( 0.06%)    688 bytes ( 0.06%)
           OpSGreaterThanEqual =     39 hits ( 0.05%)    780 bytes ( 0.07%)
                   OpTypeArray =     38 hits ( 0.05%)    608 bytes ( 0.05%)
                     OpTypeInt =     36 hits ( 0.05%)    576 bytes ( 0.05%)
           OpMatrixTimesVector =     32 hits ( 0.04%)    640 bytes ( 0.05%)
              OpSLessThanEqual =     21 hits ( 0.03%)    420 bytes ( 0.04%)
                        OpISub =     21 hits ( 0.03%)    420 bytes ( 0.04%)
            OpTypeSampledImage =     19 hits ( 0.03%)    228 bytes ( 0.02%)
                   OpTypeImage =     19 hits ( 0.03%)    684 bytes ( 0.06%)
                    OpTypeBool =     18 hits ( 0.02%)    144 bytes ( 0.01%)
                   OpTypeFloat =     18 hits ( 0.02%)    216 bytes ( 0.02%)
                  OpTypeStruct =     18 hits ( 0.02%)    864 bytes ( 0.07%)
                    OpTypeVoid =     18 hits ( 0.02%)    144 bytes ( 0.01%)
                      OpSource =     18 hits ( 0.02%)    216 bytes ( 0.02%)
               OpExtInstImport =     18 hits ( 0.02%)    432 bytes ( 0.04%)
                 OpMemoryModel =     18 hits ( 0.02%)    216 bytes ( 0.02%)
                  OpEntryPoint =     18 hits ( 0.02%)    504 bytes ( 0.04%)
               OpExecutionMode =     18 hits ( 0.02%)    216 bytes ( 0.02%)
                  OpCapability =     18 hits ( 0.02%)    144 bytes ( 0.01%)
                OpFOrdNotEqual =     16 hits ( 0.02%)    320 bytes ( 0.03%)
                  OpTypeMatrix =     13 hits ( 0.02%)    208 bytes ( 0.02%)
                        OpIMul =     13 hits ( 0.02%)    260 bytes ( 0.02%)
                        OpSDiv =     12 hits ( 0.02%)    240 bytes ( 0.02%)
                      OpSelect =      7 hits ( 0.01%)    168 bytes ( 0.01%)
                OpConstantTrue =      5 hits ( 0.01%)     60 bytes ( 0.01%)
               OpConstantFalse =      5 hits ( 0.01%)     60 bytes ( 0.01%)
                         OpAny =      4 hits ( 0.01%)     64 bytes ( 0.01%)
           OpVectorTimesMatrix =      3 hits ( 0.00%)     60 bytes ( 0.01%)
                        OpKill =      3 hits ( 0.00%)     12 bytes ( 0.00%)
                        OpDPdx =      2 hits ( 0.00%)     32 bytes ( 0.00%)
                        OpDPdy =      2 hits ( 0.00%)     32 bytes ( 0.00%)
                     OpSNegate =      2 hits ( 0.00%)     32 bytes ( 0.00%)
             OpLogicalNotEqual =      1 hits ( 0.00%)     20 bytes ( 0.00%)
                       OpUndef =      1 hits ( 0.00%)     12 bytes ( 0.00%)
                OpSGreaterThan =      1 hits ( 0.00%)     20 bytes ( 0.00%)

The shadertoy folder is dominated by loads and stores. Then curiously OpLabel – this indicates that there is a heavy amount of branching/looping occurring in the source shaders, as an OpLabel signifies a new basic block has been declared. OpBranch is the sixth most used opcode, which also backs up the view that these shaders make heavy use of branching/looping.

talos

Totals: 76515 hits 1369056 bytes
                        OpLoad =  14158 hits (18.50%) 226528 bytes (16.55%)
                       OpStore =   8912 hits (11.65%) 106944 bytes ( 7.81%)
            OpCompositeExtract =   8642 hits (11.29%) 174160 bytes (12.72%)
                    OpVariable =   7203 hits ( 9.41%) 115248 bytes ( 8.42%)
                 OpAccessChain =   6349 hits ( 8.30%) 135404 bytes ( 9.89%)
               OpVectorShuffle =   3885 hits ( 5.08%) 122764 bytes ( 8.97%)
                    OpConstant =   3577 hits ( 4.67%)  57232 bytes ( 4.18%)
          OpCompositeConstruct =   3158 hits ( 4.13%)  82616 bytes ( 6.03%)
                    OpDecorate =   1921 hits ( 2.51%)  30188 bytes ( 2.21%)
                       OpLabel =   1642 hits ( 2.15%)  13136 bytes ( 0.96%)
                        OpFMul =   1622 hits ( 2.12%)  32440 bytes ( 2.37%)
                 OpTypePointer =   1548 hits ( 2.02%)  24768 bytes ( 1.81%)
                     OpExtInst =   1346 hits ( 1.76%)  38388 bytes ( 2.80%)
           OpFunctionParameter =   1142 hits ( 1.49%)  13704 bytes ( 1.00%)
                        OpFAdd =   1046 hits ( 1.37%)  20920 bytes ( 1.53%)
                OpFunctionCall =    863 hits ( 1.13%)  19616 bytes ( 1.43%)
           OpVectorTimesScalar =    794 hits ( 1.04%)  15880 bytes ( 1.16%)
                    OpFunction =    743 hits ( 0.97%)  14860 bytes ( 1.09%)
                 OpFunctionEnd =    743 hits ( 0.97%)   2972 bytes ( 0.22%)
                        OpFSub =    716 hits ( 0.94%)  14320 bytes ( 1.05%)
                OpTypeFunction =    680 hits ( 0.89%)  12528 bytes ( 0.92%)
                      OpBranch =    589 hits ( 0.77%)   4712 bytes ( 0.34%)
                 OpReturnValue =    586 hits ( 0.77%)   4688 bytes ( 0.34%)
      OpImageSampleImplicitLod =    316 hits ( 0.41%)   6320 bytes ( 0.46%)
           OpBranchConditional =    305 hits ( 0.40%)   4880 bytes ( 0.36%)
                  OpTypeVector =    299 hits ( 0.39%)   4784 bytes ( 0.35%)
              OpSelectionMerge =    281 hits ( 0.37%)   3372 bytes ( 0.25%)
           OpConstantComposite =    269 hits ( 0.35%)   6432 bytes ( 0.47%)
                         OpDot =    228 hits ( 0.30%)   4560 bytes ( 0.33%)
                     OpTypeInt =    200 hits ( 0.26%)   3200 bytes ( 0.23%)
           OpVectorTimesMatrix =    191 hits ( 0.25%)   3820 bytes ( 0.28%)
                      OpReturn =    158 hits ( 0.21%)    632 bytes ( 0.05%)
                        OpIAdd =    154 hits ( 0.20%)   3080 bytes ( 0.22%)
                        OpFDiv =    152 hits ( 0.20%)   3040 bytes ( 0.22%)
                  OpTypeMatrix =    150 hits ( 0.20%)   2400 bytes ( 0.18%)
                        OpIMul =    117 hits ( 0.15%)   2340 bytes ( 0.17%)
                     OpFNegate =    117 hits ( 0.15%)   1872 bytes ( 0.14%)
                  OpTypeStruct =    107 hits ( 0.14%)   1340 bytes ( 0.10%)
            OpTypeSampledImage =    102 hits ( 0.13%)   1224 bytes ( 0.09%)
                   OpTypeImage =    102 hits ( 0.13%)   3672 bytes ( 0.27%)
                   OpTypeArray =    101 hits ( 0.13%)   1616 bytes ( 0.12%)
                  OpEntryPoint =    100 hits ( 0.13%)   5352 bytes ( 0.39%)
              OpMemberDecorate =    100 hits ( 0.13%)   2000 bytes ( 0.15%)
                  OpCapability =    100 hits ( 0.13%)    800 bytes ( 0.06%)
                    OpTypeVoid =    100 hits ( 0.13%)    800 bytes ( 0.06%)
               OpExtInstImport =    100 hits ( 0.13%)   2400 bytes ( 0.18%)
                 OpMemoryModel =    100 hits ( 0.13%)   1200 bytes ( 0.09%)
                   OpTypeFloat =    100 hits ( 0.13%)   1200 bytes ( 0.09%)
        OpFOrdGreaterThanEqual =     84 hits ( 0.11%)   1680 bytes ( 0.12%)
             OpFOrdGreaterThan =     78 hits ( 0.10%)   1560 bytes ( 0.11%)
                    OpTypeBool =     75 hits ( 0.10%)    600 bytes ( 0.04%)
                OpFOrdLessThan =     73 hits ( 0.10%)   1460 bytes ( 0.11%)
               OpExecutionMode =     63 hits ( 0.08%)    756 bytes ( 0.06%)
                  OpLogicalNot =     36 hits ( 0.05%)    576 bytes ( 0.04%)
      OpImageSampleExplicitLod =     36 hits ( 0.05%)   1008 bytes ( 0.07%)
           OpMatrixTimesScalar =     32 hits ( 0.04%)    640 bytes ( 0.05%)
                   OpLoopMerge =     24 hits ( 0.03%)    384 bytes ( 0.03%)
           OpMatrixTimesVector =     16 hits ( 0.02%)    320 bytes ( 0.02%)
           OpMatrixTimesMatrix =     15 hits ( 0.02%)    300 bytes ( 0.02%)
  OpImageSampleDrefExplicitLod =     14 hits ( 0.02%)    448 bytes ( 0.03%)
                   OpSLessThan =     14 hits ( 0.02%)    280 bytes ( 0.02%)
                 OpConvertFToS =      8 hits ( 0.01%)    128 bytes ( 0.01%)
                      OpFwidth =      8 hits ( 0.01%)    128 bytes ( 0.01%)
                   OpTranspose =      6 hits ( 0.01%)     96 bytes ( 0.01%)
                   OpLogicalOr =      4 hits ( 0.01%)     80 bytes ( 0.01%)
                OpFOrdNotEqual =      4 hits ( 0.01%)     80 bytes ( 0.01%)
                        OpKill =      4 hits ( 0.01%)     16 bytes ( 0.00%)
                         OpPhi =      3 hits ( 0.00%)     84 bytes ( 0.01%)
           OpFOrdLessThanEqual =      2 hits ( 0.00%)     40 bytes ( 0.00%)
             OpLogicalNotEqual =      2 hits ( 0.00%)     40 bytes ( 0.00%)

The talos folder is dominated once again by loads and stores. Next is OpCompositeExtract – which is extracting an element from a composite (aggregate, matrix or vector). I’d take a guess that there is a lot of vector math going on in these shaders, as the sixth most used opcode is OpVectorShuffle.

unity

Totals: 77783 hits 1346368 bytes
                    OpDecorate =  18764 hits (24.12%) 235700 bytes (17.51%)
                        OpLoad =  13131 hits (16.88%) 210096 bytes (15.60%)
                       OpStore =   6949 hits ( 8.93%)  83388 bytes ( 6.19%)
                 OpAccessChain =   5562 hits ( 7.15%) 115316 bytes ( 8.56%)
               OpVectorShuffle =   3883 hits ( 4.99%) 125980 bytes ( 9.36%)
                        OpName =   3428 hits ( 4.41%)  69392 bytes ( 5.15%)
              OpMemberDecorate =   3145 hits ( 4.04%)  59492 bytes ( 4.42%)
                    OpVariable =   3062 hits ( 3.94%)  48992 bytes ( 3.64%)
                        OpFMul =   3017 hits ( 3.88%)  60340 bytes ( 4.48%)
                  OpMemberName =   2293 hits ( 2.95%)  72892 bytes ( 5.41%)
                        OpFAdd =   1977 hits ( 2.54%)  39540 bytes ( 2.94%)
                    OpConstant =   1784 hits ( 2.29%)  28544 bytes ( 2.12%)
                 OpTypePointer =   1742 hits ( 2.24%)  27872 bytes ( 2.07%)
                     OpExtInst =    916 hits ( 1.18%)  24552 bytes ( 1.82%)
                     OpFNegate =    831 hits ( 1.07%)  13296 bytes ( 0.99%)
                       OpLabel =    726 hits ( 0.93%)   5808 bytes ( 0.43%)
                         OpDot =    686 hits ( 0.88%)  13720 bytes ( 1.02%)
          OpCompositeConstruct =    598 hits ( 0.77%)  14156 bytes ( 1.05%)
                   OpTypeArray =    496 hits ( 0.64%)   7936 bytes ( 0.59%)
                      OpBranch =    381 hits ( 0.49%)   3048 bytes ( 0.23%)
                  OpTypeStruct =    367 hits ( 0.47%)  12108 bytes ( 0.90%)
                  OpTypeVector =    347 hits ( 0.45%)   5552 bytes ( 0.41%)
            OpCompositeExtract =    269 hits ( 0.35%)   5380 bytes ( 0.40%)
      OpImageSampleImplicitLod =    241 hits ( 0.31%)   4820 bytes ( 0.36%)
           OpConstantComposite =    232 hits ( 0.30%)   5548 bytes ( 0.41%)
                        OpFDiv =    217 hits ( 0.28%)   4340 bytes ( 0.32%)
                     OpTypeInt =    215 hits ( 0.28%)   3440 bytes ( 0.26%)
           OpBranchConditional =    210 hits ( 0.27%)   3360 bytes ( 0.25%)
              OpSelectionMerge =    206 hits ( 0.26%)   2472 bytes ( 0.18%)
             OpSourceExtension =    157 hits ( 0.20%)   5372 bytes ( 0.40%)
                 OpFunctionEnd =    130 hits ( 0.17%)    520 bytes ( 0.04%)
                      OpReturn =    130 hits ( 0.17%)    520 bytes ( 0.04%)
                    OpFunction =    130 hits ( 0.17%)   2600 bytes ( 0.19%)
                  OpCapability =    123 hits ( 0.16%)    984 bytes ( 0.07%)
                OpTypeFunction =    117 hits ( 0.15%)   1420 bytes ( 0.11%)
                    OpTypeVoid =    113 hits ( 0.15%)    904 bytes ( 0.07%)
                      OpSource =    113 hits ( 0.15%)   1356 bytes ( 0.10%)
                  OpEntryPoint =    113 hits ( 0.15%)   5712 bytes ( 0.42%)
               OpExtInstImport =    113 hits ( 0.15%)   2712 bytes ( 0.20%)
                   OpTypeFloat =    113 hits ( 0.15%)   1356 bytes ( 0.10%)
                 OpMemoryModel =    113 hits ( 0.15%)   1356 bytes ( 0.10%)
                OpFOrdLessThan =     95 hits ( 0.12%)   1900 bytes ( 0.14%)
            OpTypeSampledImage =     91 hits ( 0.12%)   1092 bytes ( 0.08%)
                   OpTypeImage =     91 hits ( 0.12%)   3276 bytes ( 0.24%)
               OpExecutionMode =     89 hits ( 0.11%)   1100 bytes ( 0.08%)
                    OpTypeBool =     52 hits ( 0.07%)    416 bytes ( 0.03%)
      OpImageSampleExplicitLod =     40 hits ( 0.05%)   1120 bytes ( 0.08%)
                OpFunctionCall =     27 hits ( 0.03%)    532 bytes ( 0.04%)
                OpFOrdNotEqual =     23 hits ( 0.03%)    460 bytes ( 0.03%)
                        OpIAdd =     23 hits ( 0.03%)    460 bytes ( 0.03%)
                     OpBitcast =     22 hits ( 0.03%)    352 bytes ( 0.03%)
           OpFunctionParameter =     15 hits ( 0.02%)    180 bytes ( 0.01%)
                   OpFOrdEqual =      9 hits ( 0.01%)    180 bytes ( 0.01%)
                      OpSelect =      5 hits ( 0.01%)    120 bytes ( 0.01%)
                   OpINotEqual =      5 hits ( 0.01%)    100 bytes ( 0.01%)
                  OpEmitVertex =      5 hits ( 0.01%)     20 bytes ( 0.00%)
                        OpIMul =      5 hits ( 0.01%)    100 bytes ( 0.01%)
  OpImageSampleDrefExplicitLod =      5 hits ( 0.01%)    160 bytes ( 0.01%)
                        OpKill =      5 hits ( 0.01%)     20 bytes ( 0.00%)
                 OpConvertUToF =      4 hits ( 0.01%)     64 bytes ( 0.00%)
                   OpSLessThan =      4 hits ( 0.01%)     80 bytes ( 0.01%)
              OpControlBarrier =      4 hits ( 0.01%)     64 bytes ( 0.00%)
                   OpLoopMerge =      4 hits ( 0.01%)     64 bytes ( 0.00%)
                      OpIEqual =      4 hits ( 0.01%)     80 bytes ( 0.01%)
        OpFOrdGreaterThanEqual =      3 hits ( 0.00%)     60 bytes ( 0.00%)
                        OpUMod =      2 hits ( 0.00%)     40 bytes ( 0.00%)
                   OpULessThan =      2 hits ( 0.00%)     40 bytes ( 0.00%)
            OpShiftLeftLogical =      2 hits ( 0.00%)     40 bytes ( 0.00%)
                OpEndPrimitive =      2 hits ( 0.00%)      8 bytes ( 0.00%)
                 OpConvertSToF =      1 hits ( 0.00%)     16 bytes ( 0.00%)
                  OpLogicalAnd =      1 hits ( 0.00%)     20 bytes ( 0.00%)
                     OpSNegate =      1 hits ( 0.00%)     16 bytes ( 0.00%)
             OpCompositeInsert =      1 hits ( 0.00%)     24 bytes ( 0.00%)
            OpTypeRuntimeArray =      1 hits ( 0.00%)     12 bytes ( 0.00%)

And lastly the unity folder. These shader modules are dominated by OpDecorate. The next three most used opcodes are OpLoad, OpStore and OpAccessChain – so loading and storing to variables is taking up a sizeable amount of the shader modules.

All Together

If we look at all the folders above as one output from  spirv-stats instead:

Totals: 286932 hits 4985312 bytes
                        OpLoad =  49915 hits (17.40%) 798640 bytes (16.02%)
                       OpStore =  29233 hits (10.19%) 350796 bytes ( 7.04%)
                    OpDecorate =  23770 hits ( 8.28%) 312964 bytes ( 6.28%)
                 OpAccessChain =  20116 hits ( 7.01%) 421496 bytes ( 8.45%)
                    OpVariable =  19041 hits ( 6.64%) 304656 bytes ( 6.11%)
              OpMemberDecorate =  14332 hits ( 4.99%) 280916 bytes ( 5.63%)
                    OpConstant =  10823 hits ( 3.77%) 173168 bytes ( 3.47%)
                       OpLabel =   9915 hits ( 3.46%)  79320 bytes ( 1.59%)
               OpVectorShuffle =   9732 hits ( 3.39%) 308372 bytes ( 6.19%)
            OpCompositeExtract =   9595 hits ( 3.34%) 193220 bytes ( 3.88%)
                        OpName =   9233 hits ( 3.22%) 164092 bytes ( 3.29%)
                        OpFMul =   8532 hits ( 2.97%) 170640 bytes ( 3.42%)
          OpCompositeConstruct =   6678 hits ( 2.33%) 166680 bytes ( 3.34%)
                        OpFAdd =   5922 hits ( 2.06%) 118440 bytes ( 2.38%)
                 OpTypePointer =   5486 hits ( 1.91%)  87776 bytes ( 1.76%)
                     OpExtInst =   5257 hits ( 1.83%) 145980 bytes ( 2.93%)
                      OpBranch =   5229 hits ( 1.82%)  41832 bytes ( 0.84%)
           OpBranchConditional =   3193 hits ( 1.11%)  51088 bytes ( 1.02%)
              OpSelectionMerge =   3109 hits ( 1.08%)  37308 bytes ( 0.75%)
                        OpFSub =   2668 hits ( 0.93%)  53360 bytes ( 1.07%)
                  OpMemberName =   2507 hits ( 0.87%)  78044 bytes ( 1.57%)
                OpFunctionCall =   2198 hits ( 0.77%)  58520 bytes ( 1.17%)
           OpConstantComposite =   2155 hits ( 0.75%)  50928 bytes ( 1.02%)
           OpFunctionParameter =   2117 hits ( 0.74%)  25404 bytes ( 0.51%)
                         OpDot =   1911 hits ( 0.67%)  38220 bytes ( 0.77%)
           OpVectorTimesScalar =   1488 hits ( 0.52%)  29760 bytes ( 0.60%)
                    OpFunction =   1398 hits ( 0.49%)  27960 bytes ( 0.56%)
                 OpFunctionEnd =   1398 hits ( 0.49%)   5592 bytes ( 0.11%)
                OpTypeFunction =   1175 hits ( 0.41%)  21076 bytes ( 0.42%)
                  OpTypeVector =   1110 hits ( 0.39%)  17760 bytes ( 0.36%)
                  OpTypeStruct =   1065 hits ( 0.37%)  58496 bytes ( 1.17%)
                     OpFNegate =   1038 hits ( 0.36%)  16608 bytes ( 0.33%)
                   OpTypeArray =   1038 hits ( 0.36%)  16608 bytes ( 0.33%)
      OpImageSampleImplicitLod =    969 hits ( 0.34%)  19660 bytes ( 0.39%)
                        OpFDiv =    961 hits ( 0.33%)  19220 bytes ( 0.39%)
                 OpReturnValue =    928 hits ( 0.32%)   7424 bytes ( 0.15%)
                   OpFOrdEqual =    722 hits ( 0.25%)  14440 bytes ( 0.29%)
                     OpTypeInt =    661 hits ( 0.23%)  10576 bytes ( 0.21%)
           OpFOrdLessThanEqual =    595 hits ( 0.21%)  11900 bytes ( 0.24%)
                OpFOrdLessThan =    588 hits ( 0.20%)  11760 bytes ( 0.24%)
                      OpIEqual =    586 hits ( 0.20%)  11720 bytes ( 0.24%)
                      OpReturn =    525 hits ( 0.18%)   2100 bytes ( 0.04%)
                        OpIAdd =    465 hits ( 0.16%)   9300 bytes ( 0.19%)
                   OpTypeImage =    437 hits ( 0.15%)  15732 bytes ( 0.32%)
            OpTypeSampledImage =    437 hits ( 0.15%)   5244 bytes ( 0.11%)
        OpFOrdGreaterThanEqual =    412 hits ( 0.14%)   8240 bytes ( 0.17%)
             OpFOrdGreaterThan =    391 hits ( 0.14%)   7820 bytes ( 0.16%)
      OpImageSampleExplicitLod =    376 hits ( 0.13%)  11128 bytes ( 0.22%)
                  OpCapability =    372 hits ( 0.13%)   2976 bytes ( 0.06%)
               OpExtInstImport =    341 hits ( 0.12%)   8184 bytes ( 0.16%)
                 OpMemoryModel =    341 hits ( 0.12%)   4092 bytes ( 0.08%)
                  OpEntryPoint =    341 hits ( 0.12%)  17808 bytes ( 0.36%)
                    OpTypeVoid =    341 hits ( 0.12%)   2728 bytes ( 0.05%)
                   OpTypeFloat =    341 hits ( 0.12%)   4092 bytes ( 0.08%)
                  OpLogicalAnd =    331 hits ( 0.12%)   6620 bytes ( 0.13%)
                         OpPhi =    281 hits ( 0.10%)   7868 bytes ( 0.16%)
                  OpTypeMatrix =    255 hits ( 0.09%)   4080 bytes ( 0.08%)
               OpExecutionMode =    235 hits ( 0.08%)   2852 bytes ( 0.06%)
  OpImageSampleDrefExplicitLod =    226 hits ( 0.08%)   7232 bytes ( 0.15%)
                    OpTypeBool =    212 hits ( 0.07%)   1696 bytes ( 0.03%)
           OpVectorTimesMatrix =    194 hits ( 0.07%)   3880 bytes ( 0.08%)
             OpSourceExtension =    167 hits ( 0.06%)   5732 bytes ( 0.11%)
                  OpLogicalNot =    160 hits ( 0.06%)   2560 bytes ( 0.05%)
                      OpSource =    141 hits ( 0.05%)   1692 bytes ( 0.03%)
                        OpIMul =    135 hits ( 0.05%)   2700 bytes ( 0.05%)
                 OpConvertSToF =    116 hits ( 0.04%)   1856 bytes ( 0.04%)
                        OpFMod =    114 hits ( 0.04%)   2280 bytes ( 0.05%)
                   OpLogicalOr =     93 hits ( 0.03%)   1860 bytes ( 0.04%)
                   OpSLessThan =     92 hits ( 0.03%)   1840 bytes ( 0.04%)
                   OpLoopMerge =     84 hits ( 0.03%)   1344 bytes ( 0.03%)
                      OpSelect =     68 hits ( 0.02%)   1632 bytes ( 0.03%)
                 OpConvertFToS =     67 hits ( 0.02%)   1072 bytes ( 0.02%)
                     OpBitcast =     66 hits ( 0.02%)   1056 bytes ( 0.02%)
           OpMatrixTimesVector =     48 hits ( 0.02%)    960 bytes ( 0.02%)
            OpShiftLeftLogical =     46 hits ( 0.02%)    920 bytes ( 0.02%)
                OpFOrdNotEqual =     43 hits ( 0.01%)    860 bytes ( 0.02%)
                        OpKill =     40 hits ( 0.01%)    160 bytes ( 0.00%)
           OpSGreaterThanEqual =     39 hits ( 0.01%)    780 bytes ( 0.02%)
           OpMatrixTimesScalar =     32 hits ( 0.01%)    640 bytes ( 0.01%)
                        OpISub =     21 hits ( 0.01%)    420 bytes ( 0.01%)
              OpSLessThanEqual =     21 hits ( 0.01%)    420 bytes ( 0.01%)
           OpMatrixTimesMatrix =     15 hits ( 0.01%)    300 bytes ( 0.01%)
                        OpSDiv =     12 hits ( 0.00%)    240 bytes ( 0.00%)
                      OpFwidth =      8 hits ( 0.00%)    128 bytes ( 0.00%)
                 OpConvertUToF =      6 hits ( 0.00%)     96 bytes ( 0.00%)
                   OpTranspose =      6 hits ( 0.00%)     96 bytes ( 0.00%)
                OpConstantTrue =      5 hits ( 0.00%)     60 bytes ( 0.00%)
                   OpINotEqual =      5 hits ( 0.00%)    100 bytes ( 0.00%)
                  OpEmitVertex =      5 hits ( 0.00%)     20 bytes ( 0.00%)
               OpConstantFalse =      5 hits ( 0.00%)     60 bytes ( 0.00%)
              OpControlBarrier =      4 hits ( 0.00%)     64 bytes ( 0.00%)
                         OpAny =      4 hits ( 0.00%)     64 bytes ( 0.00%)
                     OpSNegate =      3 hits ( 0.00%)     48 bytes ( 0.00%)
             OpLogicalNotEqual =      3 hits ( 0.00%)     60 bytes ( 0.00%)
        OpVectorExtractDynamic =      3 hits ( 0.00%)     60 bytes ( 0.00%)
                   OpULessThan =      2 hits ( 0.00%)     40 bytes ( 0.00%)
                        OpUMod =      2 hits ( 0.00%)     40 bytes ( 0.00%)
                        OpDPdx =      2 hits ( 0.00%)     32 bytes ( 0.00%)
                        OpDPdy =      2 hits ( 0.00%)     32 bytes ( 0.00%)
                OpEndPrimitive =      2 hits ( 0.00%)      8 bytes ( 0.00%)
             OpCompositeInsert =      1 hits ( 0.00%)     24 bytes ( 0.00%)
            OpTypeRuntimeArray =      1 hits ( 0.00%)     12 bytes ( 0.00%)
                       OpUndef =      1 hits ( 0.00%)     12 bytes ( 0.00%)
                OpSGreaterThan =      1 hits ( 0.00%)     20 bytes ( 0.00%)

We can see that loading and storing dominates our shader modules at 28% of the opcodes and 25% of the binary size.

Summary

The tool showed us some interesting divergent trends across each of the providers of the SPIR-V shader modules. Thanks to Valve, Shadertoy, Croteam and Unity for allowing @aras_p to use their SPIR-V shaders when he wrote his smol-v tool. I wouldn’t have had such interesting source material otherwise to run my tool against!

The  spirv-stats tool can be got via its GitHub repository – hope it is useful to someone!