Moving The Machinery to Bindless

It’s been quite some time since I wrote anything about the low-level rendering architecture in The Machinery, but recently a lot of stuff has happened and one thing in particular feels worth covering: I’m happy to say that I finally feel that we have reached a pretty nice solution to one of my biggest frustration points in our rendering APIs — the resource binding model. I.e. the way we expose access to GPU resources (such as buffers or textures) to the shaders.

I’ve blogged about this topic before:

The fact that I wrote a post titled “Efficient binding of shader resources” feels a bit ironic today since what I ended up with was quite far from being efficient. Sure, it wasn’t a total catastrophe, but I’ve always felt it should be possible to create something better. Unfortunately back then, in the early Vulkan 1.0 days, there wasn’t proper support for going full bindless in Vulkan.

The name bindless is a somewhat overloaded term, so for the sake of clarity, in this post I’ll use it to describe a binding model that allows for a constant number of bind points per GPU resource that can be looked up from any shader using simple array indirection.

Resource Handles

It’s important to realize that there’s not always a 1:1 relationship between resources and bind points. The reason for this is that a resource can have multiple views depending on how it will be accessed by the graphics pipeline. The most common view separation is between read and write access of a resource, but there’s also a need for being able to bind views for sub-resource access such as the individual 2D surfaces in a cubemap or mip-chain.

Since all rendering is heavily multi-threaded in The Machinery, we try to carefully design our systems to avoid fine-grained access to centralized data stores as that typically requires some kind of locking. Having any kind of lock-in high-throughput multi-threaded code very easily leads to serious lock contention crippling the performance.

For the binding model, what that means is that we’d like to allocate the bind points once during resource creation, and have them available throughout the lifetime of the resource in a form that requires no centralized lookup before we can pass them to a shader. Luckily, a bind point is just an array index represented as a uint32_t that we can embed in the handle returned from the tm_renderer_resource_command_buffer_api responsible for all GPU resource creation:

// A `tm_renderer_handle_t` identifies a GPU resource allocated using one of the
// `create_*()`methods in the [[tm_renderer_resource_command_buffer_api]].
typedef struct tm_renderer_handle_t
{
    // Handle identifying a resource in the render backend.
    uint32_t resource;

    // Optional bindless srv & uav handles identifying the bindless array element
    // of the resource if the backend is setup to run in bindless mode.
    uint32_t bindless_srv;
    uint32_t bindless_uav;
} tm_renderer_handle_t;

// Interface for allocating and deallocating render resources, free-threaded.
struct tm_renderer_resource_command_buffer_api
{
    // Creates a buffer described by `tm_render_buffer_t` and returns a handle to
    // it.
    tm_renderer_handle_t (*create_buffer)(
        struct tm_renderer_resource_command_buffer_o *inst,
        const struct tm_renderer_buffer_desc_t *buffer,
        uint32_t device_affinity_mask);

    // and so on...
};

I’ve opted to use the DirectX terminology SRV (ShaderResourceView) and UAV (UnorderedAccessView) as I think that is what most graphics programmers are used to.

Vulkan implementation

In the Vulkan backend, the implementation relies on running on a Vulkan 1.2 instance where [VK_EXT_descriptor_indexing](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_EXT_descriptor_indexing.html) nowadays is a core feature. As you will see, where we used to have a rather complicated Descriptor Sets management system we now have something really simple. Let’s take a look at the actual implementation.

Setup

When we boot the backend, we create a single VkDescriptorPool, from which we allocate a single global VkDescriptorSet using a single VkDescriptorSetLayout:

const VkDescriptorSetLayoutBinding bindless_layout[] = {
    {
        .binding = 0, 
        .descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER,
        .descriptorCount = device->bindless_manager.setup.storage_buffers,
        .stageFlags = VK_SHADER_STAGE_ALL
    },
    {
        .binding = 1, 
        .descriptorType = VK_DESCRIPTOR_TYPE_SAMPLED_IMAGE,
        .descriptorCount = device->bindless_manager.setup.sampled_images,
        .stageFlags = VK_SHADER_STAGE_ALL
    },
    {
        .binding = 2, 
        .descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_IMAGE,
        .descriptorCount = device->bindless_manager.setup.storage_images,
        .stageFlags = VK_SHADER_STAGE_ALL
    },
    {
        .binding = 3, 
        .descriptorType = VK_DESCRIPTOR_TYPE_SAMPLER,
        .descriptorCount = device->bindless_manager.setup.samplers,
        .stageFlags = VK_SHADER_STAGE_ALL
    },
    {
        .binding = 4,
        .descriptorType = VK_DESCRIPTOR_TYPE_ACCELERATION_STRUCTURE_KHR,
        .descriptorCount =
            device->bindless_manager.setup.acceleration_structures,
        .stageFlags = VK_SHADER_STAGE_ALL
    },
};

The application can optionally specify the array sizes, if nothing has been supplied we default to arbitrary but rather beefy array sizes:

static const tm_vulkan_bindless_setup_t default_bindless_setup = {
    .storage_buffers = 512 * 1024,
    .samplers = 4 * 1024,
    .sampled_images = 512 * 1024,
    .storage_images = 64 * 1024,
    .acceleration_structures = 32 * 1024
};

During setup, we make sure we don’t exceed any device limitations. If we do, we log a warning and adjust accordingly.

To gracefully handle “null resource” access where the user tries to bind a resource to a shader that hasn’t been allocated, we have dedicated fallback null-resources that we always bind in the first n elements of each array.

Management

For each descriptor type, we maintain a free array holding previously freed indices as well as a next counter pointing to the next unallocated index. During resource creation, we determine how many bind points it will need and scan for a consecutive range in the free array. If that fails, we allocate a consecutive range starting from the “next unallocated index” and bump it.

Returning the bind points during resource destruction relies on the same mechanism we use for destroying the actual resource. For each vkQueueSubmit() we have an associated VkFence that we monitor for completion. Once we are certain that there are no more commands in-flight referencing the resource, we destroy it, release its memory, and return the bind points to the appropriate free arrays.

Command list building

As we are now only dealing with a single VkDescriptorSet and a single VkDescriptorSetLayout, a lot of the performance-sensitive code around pipeline creation and descriptor set binding simply goes away. We end up with much fewer VkPipelines and only have to bind the global descriptor set once per VkPipelineBindPoint and command buffer.

We still have the old code path hanging around for cross-referencing and potential fallback if we need to bring up The Machinery on a Vulkan capable platform without full support for [VK_EXT_descriptor_indexing](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_EXT_descriptor_indexing.html). But just looking at the sheer amount of code related to the old way of doing descriptor set management makes my whole body itch, I want to nuke it so badly! And to be honest, it’s very likely it will rot as that code path isn’t getting exercised, so I will probably remove it soon, and if worst comes to worst I’ll bite the bullet and re-implement something similar in the future.

Not too surprising, the new code runs A LOT faster too. In one of my more heavy test scenes, I measured a 6x speed improvement during command buffer building, compared to the old model.

Shader System Integration

We have a Shader System in The Machinery that provides a high-level abstraction when authoring shaders that support concepts such as shader variation selection, multi-pass shaders, and viewer contexts.

When authoring a shader that runs through the shader system, it also provides you with the concepts of a constant buffer and a resource binder — basically, a mechanism that allows the author to declare named resources and constants that the shader operates on.

The tm_shader_api exposes functions for creating, updating, and destroying instances of shader constant buffers and resource binders. On the inside, they map to two raw buffers (ByteAddressBuffers), one that holds all the instanced constant buffers and one that holds all the instanced resource binders. The content of a resource binder instance is just an array of indices, which is derived from the tm_renderer_handle_t::bindless_srv or tm_renderer_handle_t::bindless_uav.

For all the shader stages except ray tracing stages, we then encode the identity of the constant buffer and resource binder instances into a few Push Constants. For the ray tracing stages, we use the exact same approach but expose the identity through the Shader Record (written to the Shader Binding Table).

The shader system automatically emits HLSL load-/get functions for all the requested constants and resources, resolving the exact binding information by querying the low-level shader compiler for each render backend based on the resource type, usage flags, and wanted access.

What this means in practice is that as long as the users author their shaders using our shader system, it will act as an abstraction for how the low-level binding model in the backend is implemented, providing a platform-/graphics API- independent interface for accessing resources and constants in their shader and C code.

Wrap up

After spending some time ironing out some initial bumps with not correctly respecting all device limitations, and an accidental breaking of a NonUniformResourceIndex() (due to the additional array indirection), everything now appears to be well-behaving and stable. The performance is great, the code is way simpler, and everything mapped really well to the ray tracing pipeline.

by Tobias Persson