A modern rendering architecture

May 9, 2017

Over the last few weeks I’ve been deep into exploration mode, sketching out and prototyping the rendering API that we will provide in The Machinery. By “rendering API” I basically mean a cross-platform, graphics API agnostic library for preparing and issuing draw and compute work targeting one or many GPUs.

In this post I want to give you an idea of what I’m aiming for and provide you with a status update of where I’m currently at.

So what goes into a modern rendering architecture in 2017? I would argue it mainly boils down to four things:

Flexibility: We want an API that is flexible in the sense that it can accommodate many different use scenarios. We don’t want to make any hard assumption about how the user intends to use the renderer. Even if the most common use case ends up being rendering pretty triangle soups, we still want to make it attractive for the user to experiment with other things — e.g. more exotic rendering pipelines, or making GPU power accessible to tool pipelines, etc.

Simplicity: By simplicity I don’t mean the ease of use of the API, more important is keeping the abstraction on top of the underlying graphics APIs and hardware as thin and simple as possible. Obviously we want to make it fun, and easy to use, but only if it can be done without sacrificing performance.

Massive parallelism: In The Machinery we are using a fiber based job system as our core building block for achieving multi-core parallelism. Our goal is to drive the entire update loop using jobs instead of having “controller”- (or “coordinator”-) threads being responsible for that. Needless to say we want to be able to go as data-parallel as possible when feeding the rendering API, with as few code sections executing in serial as possible.

Scheduling: As you will see; what we end up with is a command list based architecture, very similar to how modern “explicit” graphics APIs work. There is one distinct difference though — we also provide a mechanism for reasoning about scheduling of work between different command lists. Without this it becomes very painful for the user of the API to feed it efficiently as they would have to build some kind of sorting mechanism on top of it. Doing so would likely result in additional buffering of commands and create unnecessary overhead.

On top of the four main points above we also want to make it easy to reason about and expose modern low-level rendering concepts such as asynchronous compute, explicit multi-GPU execution, etc.

Alright, so now that we have established some prerequisites, let’s take a quick look at what we currently have in place.

In The Machinery almost all systems are built as plugins on top of a very minimalistic foundation library. The rendering API is no exception. Currently I’m working on two plugins in parallel, the rendering API itself, simply called renderer, and a Vulkan backend called render-backend-vulkan.

While the Vulkan backend is what has taken most of my time (ramping up on the API, typing until my fingers bleed — yes it is kind of verbose — and rebooting my machine after various driver crashes), it is not very interesting for the sake of this blog post. The Vulkan backend’s responsibility is simply to take the data the renderer produces and translate that into Vulkan API calls.

So instead we will take a high-level look at what we have in place for the renderer.

The renderer consists of three main APIs and a few helper structures used when feeding them:

tm_renderer_backend_i

This is the common interface for communicating with any rendering backend (such as render-backend-vulkan) and is the interface that will be exposed to other plugins that want to do rendering without having to know anything about the actual backing graphics API. Currently it exposes two categories of functions:

Swap chain management — functions for creating, destroying, resizing and presenting swap chains.

A swap chain can be created both as a “virtual” swap chain (i.e not linked to an actual OS window) as well as a regular swap chain (linked to/associated with an OS window). When creating a swap chain the user has to provide a device_affinity bit-mask to indicate which device/GPU should own it.

The device_affinity mask is returned from the graphics API specific backend (tm_vulkan_backend_i — so far we only support one graphics API) when creating a device. Each created device gets a unique bit associated with it allowing the user to OR together many affinity masks making it possible to explicitly reason about one or many devices in a multi-GPU setup.
Management of tm_renderer_resource_command_buffer_i and tm_renderer_command_buffer_i — functions for creating, submitting and destroying one or many resource command buffers and command buffers.

We will discuss the meaning of a resource command buffer and command buffers in the next section, for now I just wanted to clarify why the functions for creating and destroying the buffers are part of the backend interface. The reason for this has to do with life time management of the buffers. Both resource command buffers and command buffers are created from pools and are marked as free and returned to their respective pool after the backend is done consuming them. However it is only the backend that knows when the buffers have been fully consumed, hence it becomes natural that the functions for creating and destroying (or rather flagging the buffer for recycling) are exposed from the backend interface.

The create-function for both resource command buffers and command buffers takes a device_affinity mask as argument, this allows the backend to optimize device memory management when creating and updating buffers targeting multiple devices.

tm_renderer_resource_command_buffer_i

This is the interface for populating a resource command buffer. Both resource command buffers and command buffers are “free-threaded” — i.e any number of them can be in-flight in parallel and as long as the user guarantees that it is only one thread poking at each of them at a time, everything is fine.

The resource command buffers are used for creating and destroying GPU resources. Currently we only have three different types of resources:

tm_renderer_buffer_t — represents all kinds of linear memory buffers, such as raw buffers, vertex buffers, index buffers and constant buffers.
tm_renderer_image_buffer_t — represents all kinds of image buffers, such as textures and render targets.
tm_renderer_shader_t — represents a graphics API specific structure describing a shader.

(Note: It feels very likely that we will also end up exposing some kind of additional resource describing a “resource bindings scheme” for shaders — I just haven’t really designed that part yet. The general design idea though is to keep the number of different resource types very low. )

When the user calls a create-function we allocate a unique handle identifying the resource, and write a small command header to an array. Any additional data that needs to be associated with the command (like e.g the resource declaration) gets written to a raw memory buffer. A void pointer in the command header points to the additional data. (Memory allocation for the raw memory buffers happens in blocks of 2 MB, these blocks are recycled into a free pool after the buffer has been fully consumed by the backend. )

If the user wants to populate a buffer with content when creating it, she can call a second category of the create functions called map_create. This will return a void pointer that the user can fill out with content before submitting the buffer to the backend.

As updating and resizing of existing buffers needs proper scheduling to behave as intended those commands go through the command buffer.

tm_renderer_command_buffer_i

The underlying architecture for resource command buffers and command buffers have a lot in common with one major exception — command buffers supports sorting of its commands. In other words; each command in a tm_renderer_command_buffer_i knows its absolute execution order in the backend in comparison to other commands (often spanning multiple command buffers). This is implemented by associating each command with a 64-bit sort key and before submitting a bunch of buffers to the backend, merge their command arrays and sort them based on this key.

This allows the user to go wide when populating command buffers without having to care about the final execution ordering of the commands in the backend, as long as all the systems populating them know how to reason about some kind of absolute sort order.

You could argue that this is a small thing and that we could just as well let the user be responsible for populating command buffers in the right order, but in my experience providing a low-level concept for scheduling commands between multiple buffers helps a lot when it comes to keeping more high-level systems decoupled.

Command buffers are used for feeding the backend with any kind of work that does not relate to resource creation/destruction, such as issuing draw calls/compute work, making bigger state changes (e.g. render target binding), updating and resizing of resources and much more. I won’t go into any details regarding the actual commands just yet as most of them haven’t been implemented or even designed at this point.

Wrap up

There’s still a lot of work to be done until the renderer for The Machinery reaches some kind of feature completion, but in general I’m happy with how it is shaping up. I’ve deliberately left out a bunch of details as I’m not feeling certain if they will hold up when we start experimenting with building various rendering pipelines on top of the renderer, so it’s likely I will revisit this in future posts when the code has matured a bit.

Wrap up

by Tobias Persson