Virtual Memory Tricks
I’ve been planning for some time to make a post about things you can do with virtual memory, so when @jimsagevid tweeted me about virtual memory in response to my last post, I felt the time was right.
Virtual memory is funny. As programmers, we know that it’s there (on all modern CPUs and OSs), but we tend to forget about it, perhaps because it’s not directly exposed in our programming languages. Or, we just think of it as the thing that makes our software run really slow instead of crashing, when it runs out of physical RAM.
But, it turns out, if you actually make use of what virtual memory can do, you can achieve some pretty cool things.
Obscenely big array
To recap — in my last post I faced the problem of wanting a big lookup table from object IDs to object pointers. Furthermore, I wanted this table to stay fixed in memory, so that writers could atomically replace pointers in the table without disturbing readers. This means that using something like a std::vector
is not possible, since the underlying data will be reallocated when the vector grows.
A fixed array works:
object_o *objects[MAX_OBJECTS].
But what size should we use for the array? Pick a big size, and we’re wasting memory. Pick a small size, and we might run out of objects. In the article, I used a hierarchical approach, but as @jimsagevid pointed out, we could use virtual memory instead, and avoid the complexity of a hierarchical table layout.
When we allocate memory through the virtual memory system we reserve address space for the pages we allocate immediately (no other code can allocate memory in that address range), but actual physical memory is not allocated until we actually touch the pages.
So for this situation, we can just pick a comfortably large number for our array and virtually allocate it. Say 1 billion of objects, for example:
#define MAX_OBJECTS 1000000000ULL
object_o **objects = virtual_alloc(MAX_OBJECTS * sizeof(object_o *));
We’re using 8 GB of address space and virtual memory, but only as much physical memory as we actually need for our objects. A simple solution that only requires one line of code.
Note: I’m using virtual_alloc()
here as my OS agnostic virtual memory allocation call. On Windows you would actually call VirtualAlloc() and on Linux mmap().
Another note: Windows separates virtual memory allocation into separate MEM_RESERVE and MEM_COMMIT calls. MEM_RESERVE reserves address space and MEM_COMMIT commits it to physical memory. That does not mean that physical memory is actually allocated when you MEM_COMMIT, the physical memory is still not allocated until you actually touch the pages. But MEM_COMMIT reserves memory in the page file, and if you try to commit more memory than you have in your page file, MEM_COMMIT will fail. So on Windows, you probably don’t want to MEM_COMMIT the entire lookup table (because you only have a limited size page file). Instead you want to MEM_RESERVE the whole table in the beginning and then just MEM_COMMIT the range of the table that you actually use.
In contrast, Linux allows overcommitting — i.e., you can commit more memory than the size of your page file. For this reason, Linux doesn’t need separate reserve/commit operations which makes the virtual memory a lot simpler to use.
Is reserving 8 GB of virtual memory for the array a problem? Not really. There are two limits that concern us here. The first is address space. In a 64-bit application (which we assume), the address space is 264, a staggeringly large number with room for billions of gigabyte-sized arrays. The second limit is for virtual memory. The OS typically doesn’t let us allocate the whole possible address space as virtual memory. On 64-bit Windows, for example, we can only allocate 256 TB of virtual memory. Still, that’s room for 32 000 arrays of 8 GB each, so as long as we don’t go totally crazy, we’re fine.
In fact, you could probably use this technique for all global or semi-global arrays in your application without any problems.
Remember the old-school C way of writing games with static arrays for all objects:
uint32_t num_tanks;
tank_t tanks[MAX_TANKS];
uint32_t num_bullets;
bullet_t bullets[MAX_BULLETS];
...
If you write code like this, you can be sure that someone will complain that it’s not “future-proof”, since you have caps on the number of objects. That is kind of ridiculous, but to satisfy the complaint in a simpler and more fun way than using a std::vector
you can just get rid of the MAX_*
defines and allocate 1 GB of virtual memory for each array:
#define GB 1000000000
uint32_t num_tanks;
tank_t *tanks = virtual_alloc(GB);
uint32_t num_bullets;
bullet_t *bullets = virtual_alloc(GB);
Application wide unique IDs
A lot of game engine systems need unique IDs to identify objects. Usually the code looks something like this:
uint64_t allocate_id(system_t *sys)
{
return sys->next_free_id++;
}
But virtual memory gives us another option. Instead of using integers as identifiers, we could use addresses reserved from the virtual memory system. This allows us to create IDs that are unique, not just in the current system, but throughout the entire application. And since we never actually use the memory pointed to by these addresses, we won’t consume any physical memory.
It might look something like this:
system_id_t *allocate_id(system_t *sys)
{
if (!sys->id_block || sys->id_block_used == PAGE_SIZE) {
sys->id_block = virtual_alloc(PAGE_SIZE);
sys->id_block_used = 0;
}
return (system_id_t *)(sys->id_block + sys->id_block_used++);
}
Note that by using a pointer to an opaque struct for the ID we also get some type safety, that we didn’t have with the uint64_t
approach.
Memory overwrite detection
One of the hardest bugs to fix is random memory overwrites, where some piece of code is writing to memory that it shouldn’t touch. Then, later, when some other piece of code tries to use that memory, it will read that trash data and (most likely) crash. The problem here is that while the crash itself is easy to find, it is tricky to know where the bad write that corrupted the data to begin with occurred.
Luckily, virtual memory can help us here too.
To see why, first note that the term random memory overwrite is actually a misnomer. Just like regular space, address space is mostly empty. With a 64 bit address space and an application size of say 2 GB, the address space is 99.999999988 % empty. This means that if the memory overwrites were truly random, most likely they would hit this empty space and cause a page fault/access violation. That would give us a crash at the point of the bad write, instead of the innocent read, which would make the bug much easier to find and fix.
But of course, the writes are usually not truly random. Instead they typically fall in one of two categories:
- Writing to memory that has been freed.
- Writing beyond the allocated memory for an object.
In both cases, it is pretty likely that the write will actually hit some other object, rather than empty space. In the first case, the memory will probably have been recycled for something else. In the second case, the write will probably hit a neighboring object, or allocation block headers.
We can make these cases look more like the random case by replacing the standard system allocator with an end-of-page allocator. Such an allocator uses the virtual memory system to allocate each object on its own set of pages, and furthermore, it aligns the object so that the end of the object coincides with the end of the last page.
This accomplishes two things. First, when we free the pages, we free them in the virtual memory system. This means that attempting to write to the pages after they have been freed will result in an access violation. Note that this typically doesn’t happen with a regular memory allocator — it will save the freed memory in its freelists to be used for other allocations, instead of returning it to the OS.
Second, as the end of the block coincides with the end of the page, writing beyond the end of the block will also cause an access violation. In other words, with this approach, our typical bad memory writes will crash at the point of the write and allow us to easily diagnose the problem.
Since this allocator rounds up all allocations to the page size, it will waste a lot of memory for small allocations. It is probably not something that you want enabled all the time. What I typically do is work with the standard allocator. Then, if I suspect bad memory overwrites, I switch it out for the end-of-page allocator. Once I’ve fixed the problem, I switch back to the standard allocator. This of course requires you to have an architecture where you can easily change which allocator your system is using.
Writing an end end-of-page allocator is not complicated at all. Here’s what malloc
looks like:
void *eop_malloc(uint64_t size)
{
uint64_t pages = (size + PAGE_SIZE - 1) / PAGE_SIZE;
char *base = virtual_alloc(pages * PAGE_SIZE);
uint64_t offset = pages * PAGE_SIZE - size;
return base + offset;
}
Note: There’s still bad writes that could go undetected with this approach. For example, after we’ve freed the pages, a new page allocation could happen in the same memory range. Also, it is possible that another set of pages could be allocated directly after ours in memory, in which case we wouldn’t detect overwrites beyond the last page.
Both these problems can be fixed. For the first issue, we can leave the pages reserved, but not committed. That way, the physical memory is freed and we will get page faults, but the addresses are reserved and cannot be used by other objects. For the second issue, we could reserve an extra page after our pages, but not commit that page. No other object could then claim those addresses and writing to them would still cause an access violation. (Note: This only works on Windows, where reserve and commit are separate operations.)
In practice though, I’ve never needed to take these extra precautions. For me, the basic end-of-page allocator has always been enough to find and fix the bugs.
Fragment free memory allocation
On old console hardware, memory fragmentation could give programmers nightmares. But even on modern machines, memory fragmentation can be a big problem when trying to implement efficient memory allocators.
A memory allocator works by requesting big chunks of memory from the system and chopping them up into smaller pieces on request of the user. Memory fragmentation occurs when only some of those objects are freed, leaving holes in the used memory block:
The problem here is that if the user requests a big chunk of memory, none of the “holes” may be big enough. This means we have to allocate it from the free memory block at the end — increasing the memory use of the application. In other words, we have a lot of wasted memory that we can’t really use. In the olden days, memory was a precious resource. Today, it’s less so, but we still don’t want to waste it.
Memory fragmentation is a huge topic, but instead of diving too deep into it, I’m just going to look at it from the virtual memory perspective, which is pretty simple. When you are allocating from virtual memory, fragmentation is a non-issue.
The reason is that when we are using virtual memory, each page in address space is individually mapped to a page in physical memory. So if we create “holes” in physical memory by allocating and freeing pages, it doesn’t matter, because those “holes” can later be used to allocate what to us seems like contiguous memory. Let me try to illustrate it with a picture:
Here, we have first made the red allocations and freed some of them, leaving holes in address space and physical memory. However, this doesn’t prevent us from making the big purple allocation. Each page in the purple allocation can be mapped to one of the hole pages we created earlier, without the need for allocating any more physical memory. So nothing is lost to fragmentation.
Note that we still have fragmentation in the address space. I.e. we can’t make a big contiguous allocation in the address space where the red allocations are. But we can’t have any fragmentation in physical memory. Fragmentation in address space doesn’t really worry us, because as I said before, address space is typically 99.999999988 % empty anyway, so we have no problem finding contiguous address blocks. (On 32 bit systems, it’s a different story.)
The drawback compared to using an ordinary allocator is that we have to round up the allocations to the page size, which typically is 4 K on modern systems. This rounding up means we are not using all of the available memory, a problem which is somewhat confusingly referred to as internal fragmentation.
There are lots of ways of dealing with this. For fixed size objects we can use object pools — allocate a page of memory and divide it into as many objects will fit on that page.
For dynamically growing buffers, we can just make the buffer size match the page size. This is a simple but interesting technique that I haven’t seen mentioned a lot. Say that you have an array of objects that are 300 bytes big. The conventional setup is that as you need more entries you grow the array geometrically by say doubling it in size. So you go from 16 to 32 to 64 to 128 elements. Geometrical growth is important because it means you only pay an amortized constant cost for adding an element to the array.
However, 16 * 300 = 4800. If you allocate that from virtual memory, you have to round it up to 8 K, wasting almost an entire page. But we can easily fix that. Instead of focusing on the number of elements, we just grow the buffer by multiples of the page size: 4 K, 8 K, 16 K, 32 K, … and then put as many elements as will fit in there (13, 27, 54, 109, …). This is still geometric growth, so the cost is still amortized constant, but now the internal fragmentation is just 150 bytes on average, instead of 2 K.
I find it a bit surprising that standard memory allocators don’t take more advantage of virtual memory to avoid fragmentation. Most allocators I’ve looked at still work by obtaining large chunks from the OS and chopping them up, instead of working with page sized virtual memory allocations.
Is getting larger chunks from the OS more efficient? I’m heading into territory where my knowledge is pretty sketchy here, but I don’t think so. Using a larger page size can be more efficient, because it means a smaller page table and better use of the TLB cache. But, given a fixed page size, I don’t think it matters if you have one big or many small virtual memory allocations, because the address resolve is done page-by-page anyway. And even if you allocate large chunks, consecutive virtual pages are often not consecutive in physical memory anyway.
There is probably some memory overhead for the OS to keep track of a large number of individual memory allocations. Also we need system calls to allocate and release the pages which will consume a little time. Maybe that’s the reason. Or maybe it’s just that general allocators are written to run in a variety of environments — on 32 bit systems or systems with large page sizes — so they can’t take advantage of the fragment free allocations you can get with 64 bit address space and 4 K pages.
Gapless ring buffer
This is a trick I found out about from Fabian Giesen’s blog, but the idea seems to have been around for longer.
A ring buffer is a very useful data structure for any situation where data is produced and consumed at different rates. For example, when reading from a socket, you may receive data in chunks that do not match your packet size, so you need to buffer it until you have received a full packet.
A ring buffer stores the data in an array of fixed size that “wraps around”, i.e. once the array has been filled we start writing data to the beginning of the array again.
In addition to the array we also need read and write pointers to know where in the buffer to read and write data. I like to use uint64_t
counters for this, specifying the total number of data written and read, something like this:
enum {BUFFER_SIZE = 8*1024};
struct ring_buffer_t {
uint8_t data[BUFFER_SIZE];
uint64_t read;
uint64_t written;
};
Note that since the buffer wraps around, we cannot let the write pointer run too far ahead of the read pointer or it will start to trash data that hasn’t been read yet. Basically, the BUFFER_SIZE
controls how far ahead the writer can get. If the buffer is full, the writer must stall and wait for the reader. If the buffer is empty, the reader must stall and wait for the writer. (The number of bytes available to the reader is written - read
.)
A really annoying part of ring buffers is that you need special code to handle the wraparound. If it weren’t for this, writing to the buffer would be a simple memcpy
, but instead we have to be careful that we do the right thing when the write straddles the wraparound point:
void write(ring_buffer_t *rb, uint8_t *p, uint64_t n)
{
uint64_t offset = rb->written % BUFFER_SIZE;
uint64_t space = BUFFER_SIZE - offset;
uint64_t first_write = n < space ? n : space;
memcpy(rb->data + offset, p, first_write);
memcpy(rb->data, p + first_write, n - first_write);
rb->written += n;
}
As annoying as this is, reading is even worse. If it wasn’t for the wraparound, reading wouldn’t even require any copying at all — we could just use the data directly from the buffer — but with the wraparound, we need two memcpy()
calls to move the data to a contiguous memory block for further processing.
How can virtual memory help here? We could use the “huge array” technique and just reserve a huge array instead of the ring buffer, commit pages as the writer advanced and decommit them as the reader advanced. With this we wouldn’t even need to set a fixed size for the array — it would just use as much memory as needed. Pretty ingenious. But note that you might need a huge array. To buffer a 1 Gbps network stream with an uptime of a year you will need to reserve 4 PB (petabytes) of memory. Sadly, as we saw above, 64-bit windows caps the virtual memory at 256 TB. Also, those commit and decommit calls aren’t free.
But we can fix this in another way, by taking advantage of the fact that multiple virtual pages can be setup to resolve to the same physical page. This is usually used to share memory between different processes, but we can also set it up in a single process. To use this for the ring buffer we set up a set of pages, immediately after the ring buffer that point to the same physical memory:
As you can see, with this setup the virtual memory system takes care of the wraparound for us. In the address space we can always find a contiguous copy of the memory in the ring buffer, even though the buffer is split up physically. The read and write functions become simply:
void write(ring_buffer_t *rb, uint8_t *p, uint64_t n)
{
memcpy(rb->data + (rb->written % BUFFER_SIZE), p, n);
rb->written += n;
}
uint8_t *read(ring_buffer_t *rb, uint64_t n)
{
uint8_t *p = rb->data + (rb->read % BUFFER_SIZE);
rb->read += n;
return p;
}
A lot nicer, and we’re still using the same amount of (physical) memory.
Note that setting up this memory layout can be a bit tricky. On Windows you have to create a file mapping into virtual memory with CreateFileMapping() — yes, even though no files on disk are involved we still need a “file mapping” because that’s how virtual memory is shared. Since we don’t need a file on disk we use INVALID_HANDLE_VALUE
for the file handle which creates a mapping against the page file. Then we use MapViewOfFileEx() to map that file mapping into two consecutive memory regions. Unfortunately, there is no way of guaranteeing that the memory regions we pass are available. We can reserve them, and then free them just before calling MapViewOfFileEx() , but that still leaves a window of time where if we’re super unlucky, someone else could come and allocate something in that address area. We might have to retry the mapping a few times before it succeeds, but after that setup we can use the buffer without worrying about it.
And that’s all from me
Do you know of any neat virtual memory tricks not mentioned here. Tweet them to me at @niklasfrykholm.