Recording Statistics — An Exercise in Minimalism
I recently added a Statistics view to The Machinery. It can be used to display various real-time statistics from the engine:
In addition to drawing graphs, data can also be presented in tabular form, for more precise inspection:
Graphs are useful because they expose a lot of information in a way that it can be quickly processed by the human visual cortex. Things like glitches, anomalies and bad patterns stand out in a way they don’t if you are just looking at numbers in a profiler:
-
If you see occasional spikes in the frame rate but have a hard time pinpointing where they are coming from, you can plot the time spent in some of the major engine systems and see if one of them has spikes that coincide with the frame rate spikes. If it does, you have found the main source of the frame hitches.
-
Plotting memory usage over time is a good way of finding leaks and other problematic patterns. Now, we already have memory leak detection in the engine, but it only detects memory that never gets freed. There are other kinds of “leaks” that can also cause problems. For example, you might have an array that just grows and grows, because elements keep getting added to it without being removed. This is not a true memory leak, because we are still keeping track of all allocations and the memory eventually gets freed when the array is destroyed — perhaps at the end of the level. But memory use still grows unbounded in this situation, causing the application to eventually run out. This type of “leak” can happen even in garbage-collected languages.
In the statistics view, this shows up as a typical pattern of memory use that keeps rising while a level is being played and then drops sharply when the level is restarted. Finding the root cause can still be tricky — these kind of “leaks” are harder to fix than the “real” ones — but looking at the statistics can put you on the right track and let you narrow it down to a specific system.
-
Another example — suppose you are noticing an unusually high number of draw calls in a certain part of a level. By keeping the draw call statistics on screen as you pan around with the camera, you can check if you notice any sudden jumps as certain objects enter and leave the view frustum. In this way, you can quickly pinpoint the problematic object.
In my case, I wanted to use the Statistics view to investigate a weird “jerkiness” in the camera that I was sometimes seeing. As I was panning the camera smoothly, it kept jumping back, almost as if time was going backward. Weirdly, I couldn’t capture the phenomenon with any screen capturing tools, only by filming the screen with my phone.
So I wanted to plot the frame rate and camera position over time to see if I could spot any patterns that could explain the behavior.
(Spoiler: I didn’t. Both the frame time and the camera position moved smoothly. The jerkiness seems to have something to do with running on a secondary monitor — it always runs smoothly on my main screen. This, together with the fact that I can’t screen capture the issue, seem to indicate that frames somehow get “reordered” as they are sent to the secondary monitor. Now, that sounds super weird to me and I’m not sure how that is even possible. Might be a topic for a future blog post if we can figure it out.)
But anyway, that was the impetus for creating this system. And even though it couldn’t resolve this particular issue, it’s still a very useful thing to have in the engine. In this post, I’ll talk a little about how it’s implemented.
Drawing the graphs is pretty straightforward, so let’s focus on the more interesting problem: how do we collect the data that we draw?
There are two main requirements on the data collection:
-
First, it must be extensible. The Machinery is plugin-based and each plugin should be able to add its own counters to the system. In addition, there are lots of game-specific data that are interesting to track too, such as the number of enemies, A* searches, etc.
-
Second, the system must be fast enough that you can add statistics everywhere, and don’t worry about leaving it in the code. Having the statistics always available lets you do exploratory investigations that you might not bother with if you have to explicitly insert and remove stats collection every time you want to check something. In particular, I find it very useful to have statistics counters for all profiling scopes, so we can examine them for hitches. This tends to be a lot of data, so the system needs to be efficient.
A straightforward API for data collection might look something like this:
add_to_frame_counter("renderer/primitive-count", n);
Here we use a string to identify the data counter. (We need some human-readable identification so
that we can browse for the counter in the UI.) The function call simply adds n
to the current
frame’s value for the counter.
Note that I’ve chosen to accumulate all the data for each counter over the frame. This is not a 100 % self-evident choice. Some sources might get multiple data points per frame or less than a data point per frame. For example, if you are looking at the size of inbound network packages, you might have a lot of frames where you get no package at all and some frames where you get multiple packages. Instead of tracking the package size per frame, we could treat each incoming package as its own data point.
Here’s what the data might look like using the two different methods.
Creating a data point for each incoming package:
When (frame #) | 3.5 | 6.1 | 6.3 | 9.2 |
---|---|---|---|---|
Packet size | 782 | 1003 | 450 | 510 |
Accumulating the data per frame:
Frame # | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|---|
Packet size | 0 | 0 | 0 | 782 | 0 | 0 | 1453 | 0 | 0 | 510 | 0 |
There might be some cases where it is useful to have the data disconnected from the frame rate as in the first table, but I think in most situations, the accumulated view in the second table is enough to tell you what is going on. It is also really valuable to have all the data sources in the same format — for instance, it means we can readily display them in the same table.
Also, if we accumulating the data per frame, we can automatically use the current frame time as the “clock” for all our data. If we disconnect the data points from the frames, we have to sample the clock each time we add a data point — adding more overhead.
For all these reasons, I like using frame-accumulated counters.
The implementation for our API might look something like this:
struct source_t {
const char *name;
double values[MAX_FRAMES];
};
struct stats_t {
chash32_t name_to_idx;
struct source_t *sources;
uint64_t current_frame;
} stats;
void add_to_frame_counter(const char *name, double value)
{
const uint64_t key = murmurhash(name);
uint32_t idx = chash32_get(&stats->name_to_idx, key);
if (!idx) {
source_t source = {.name = name};
carray_push(stats->sources, source);
idx = carray_size(stats->sources) - 1;
chash32_add(&stats->name_to_idx, key, idx);
}
source_t *source = stats->sources + idx;
source->values[stats->current_frame % MAX_FRAMES] += value;
}
I use a double
for the values so I can store both large integers and floats with good precision.
You could use different types for different counters. This would buy you a little bit of performance
at the price of added complexity. Again, I find it nice and simple to just have everything in the
same format.
The system keeps track of the last MAX_FRAMES
values for each counter, storing them in a ring
buffer values
and uses a hash table to look up the counter based on its name.
Straightforward, but also pretty expensive. Every time we want to increment a counter, we have to
hash the string and make a lookup in the hash table. Also, since stats
is a global variable and we
want to be able to record stats on multiple threads, we should probably put this whole thing in a
critical section too, making things even slower.
Let’s try to improve this.
The first thing we can do is cache the lookup. Looking up the same string will always result in the same value, so there is no reason for us to do the lookup every time — we can just store the result. To take advantage of this, we must make the caller keep track of the index. The API for the caller thus changes to something like:
static uint32_t primitive_counter = 0;
if (!primitive_counter)
primitive_counter = get_counter("renderer/primitive-count");
add_to_frame_counter(primitive_counter, n);
Since the result of the lookup is permanent we can store it in a static variable. We could also
store it in a struct or a global variable when the calling system is initialized to avoid the
if (!primitive_counter)
test.
The implementation now looks something like this:
uint32_t get_counter(const char *name)
{
const uint64_t key = murmurhash(name);
uint32_t idx = chash32_get(&stats->name_to_idx, key);
if (idx)
return idx;
struct source_t source = {.name = name};
carray_push(stats->sources, source);
idx = carray_size(stats->sources) - 1;
chash32_add(&stats->name_to_idx, key, idx);
return idx;
}
void add_to_frame_counter(uint32_t idx, double value)
{
struct source_t *source = stats->sources + idx;
source->values[stats->current_frame % MAX_FRAMES] += value;
}
A big improvement – the hot code path add_to_frame_counter()
is now significantly shorter.
One problem that still exists is that all these counters might be using a significant amount of
memory. Each counter is allocating MAX_FRAMES * 8
bytes of memory for its values. If we have
thousands of counters (which we can easily have with a counter for each profiler scope) and want to
save a few thousand frames of history, so we can see how the data is trending, this adds up to
multiple megabytes of data.
One way of getting around that is to only allocate the values
array for the counters that the
user is actively looking at in the Statistics view. This should be a small fraction of the total
number. When we record, we check if the array has been allocated:
void add_to_frame_counter(uint32_t idx, double value)
{
struct source_t *source = stats->sources + idx;
if (source->values)
source->values[stats->current_frame % MAX_FRAMES] += value;
}
Overall this looks pretty decent, but can we do better? That if-statement is a possible branch
misprediction. We also have the overhead of the function call itself. And, we might need a critical
section to protect against stats->sources
growing and being reallocated by a different thread.
We could inline the add_to_frame_counter()
, but that would require us to put both source_t
and
stats_t
in the header file, leaking a lot of implementation details. At Our Machinery, we like to
keep the header files minimalistic and only expose public APIs. Both to keep the compile times down
and to reduce the mental load of anyone reading the code. So what else can we do?
Let’s take a step back and ask ourselves what is the minimum thing that we need to perform the
add_to_frame_counter()
operation. We don’t need to know the index of this particular counter in
the statistics system, all we really need is a place where we can write the counter’s new value.
So why not have the get_counter()
function return exactly that:
struct source_t {
const char *name;
double *values;
double frame_value;
};
double *get_counter(const char *name)
{
// ...
return &stats->sources[idx].frame_value;
}
Here, get_counter()
just returns a pointer to an accumulator variable frame_value
. We use that
value to accumulate the counter over the frame and at the end of the frame, we add it to the
values
array.
Here is how you would use it in the code:
static double *primitive_counter;
if (!primitive_counter)
primitive_counter = get_counter("renderer/primitive-count");
*primitive_counter += n;
Again, we could perform the initialization of primitive_counter
at the initialization of this
module, so we don’t have to check if (!primitive_counter)
every frame.
Now there are no function calls, no branch statements, no critical sections, etc — the only thing we do is an assignment through a pointer.
Note that for this to work, the pointer returned by get_double()
must be permanent. I.e., once the
source_t
structure has been created, it cannot move in memory, since that would invalidate all the
get_double()
pointers that have already been returned and cached.
There are several different ways to achieve this:
-
You could use a fixed-sized array if you are OK with setting an upper limit on the number of sources.
-
You could use a VM allocated array as described in a previous post.
-
Or, you could allocate the sources using a sequence of fixed-sized blocks, instead of in a single big array.
Since we never delete counters we don’t have to keep track of a freelist (as you usually do when you have an array where you want fixed element pointers).
We copy the current frame_value
into the values
array when we register a new frame in the
statistics system (same place where we increase the current_frame
counter):
static void push_frame()
{
const uint64_t i = stats->current_frame % MAX_FRAMES;
++stats->current_frame;
for (source_t *s = stats->sources; s != carray_end(stats->sources); ++s) {
if (s->values)
s->values[i] = s->frame_value;
s->frame_value = 0;
}
}
With this, we have a very simple and fast way of recording arbitrary statistics data for later rendering.