Dig Shallow Graves
Recently I’ve been trying to help our interns with API design and I figured I should write something about it.
If I have expertise in anything, it’s probably API design. This is kind of frustrating because it is not very tangible. Like if someone is good at optimizing microcode for x86 processors or something like that, you can point to it and say look, this is what she’s good at. But API design is so much fuzzier. What even is a good API?
Not that I can’t code, or optimize, or debug — I’m pretty good at that too! In fact, I don’t think you can be a good API designer if you don’t code a lot. An API isn’t some abstract, ivory tower thing. It’s a user interface for programmers. And how are you supposed to design that user interface if you never use it?
What is a good API?
So, to get back to the question: What is a good API?
A good API is a good user interface for programmers. What does this mean?
- It’s easy to understand how to use the API.
- The code that uses the API is simple and straightforward.
- The API helps you in writing code that is performant, correct, and bug free.
- The API is pleasant to work with. (Sparks joy!)
Terms like “easy”, “simple” and “pleasant” are of course subjective. What is easy depends on what you have encountered before. Since most programmers are familiar with for-loops an API designed around that might be easier than one using other forms of iteration. But I don’t think everything about APIs is subjective. Bad APIs can feel unpleasant to work with in an almost visceral way, and they tend to share some common traits:
- Functions take lots of arguments — either directly or in structs — and behave very differently depending on what arguments you pass. Some of the arguments are only used in some of the call modes.
- You need multiple lines of setup before you can call the function that does the thing you want to do and multiple lines of cleanup afterward. It is not very clear exactly what setup and cleanup is needed and what “state” the system is in.
- The documentation is confusing and mentions lots of special cases.
Similarly, there are signs of a good API:
- Each function has a clear single purpose and few arguments. There are no confusing options that fundamentally change what the function does.
- Calling a function is a one-liner. If any setup is needed, it is clear what needs to be done and hard to do it wrong.
- The documentation tends to be short because it is obvious what the functions do and they don’t have a lot of special cases.
- Simple things are simple, complicated things are possible.
Another important property of an API is that it is easy to implement.
This may be a bit controversial. After all, an API is an interface — why should we care what the implementation looks like? Shouldn’t we just make the best possible interface regardless of how complicated it is to implement?
I think there are two counterarguments. First, you cannot view any single software system in complete isolation. Every line of code added somewhere increases the total complexity. It is not just the cost of implementation, it also needs to be maintained, upgraded, documented, refactored, learned, etc. The more complex something is, the harder all of this becomes.
Second, the purpose of an API is to abstract the underlying implementation. But the abstraction is almost never 100 % perfect. One way or another, some properties of the implementation leak through. This can be things like:
-
Debugging. You call the API, but you get an error, a crash, or some other unexpected result. Now you need to go into the implementation to figure out why you got that result. (You could of course also avoid doing that and just try to debug the system as a black box, but that’s orders of magnitude harder.)
-
Performance. You call the API, but you get worse performance than you expected or needed. Now you have to go into the actual implementation to try to understand why the calls are so expensive.
When the implementation of the API is simple, doing this kind of deep digging is easy. There is relatively little code in the implementation and it is easy to see where something goes wrong or where time is being spent. The concepts in the implementation match up with the concepts in the API, making the transition from interface to implementation straightforward.
If the implementation is complex, has multiple layers, and different concepts than the API, this is much harder.
Let’s look at some practical tips for API design.
Know your domain and your users
When you design an API you design a way of thinking and reasoning about a problem in a structured and precise way. To be able to do that you need to understand the problem domain. For example, you can’t design an API that handles colors unless you understand RGB color theory, gamma spaces, etc.
You must also understand your users. What are their needs? What are they trying to do with your API? What’s their background knowledge? You need to be able to put yourself in their shoes — empathize with them.
The best APIs are almost educational, serving as a bridge between the users’ ideas of what they want to do and your technical knowledge of how it can be achieved.
Sometimes you can achieve this by borrowing already established domain concepts and terminology. If you can, that’s great, because your users might already know them. But often you have to invent your own completely new concepts. This is the creative part of API design, which, at least for me, is a big part of the fun.
So understanding the users and the domain is paramount, but how do you acquire that understanding? That brings me to the next point.
Minimize planning
There’s a lot of programming advice out there that says that you should plan out and fully design your system before even thinking about writing any code. If you follow this school you work in distinct phases. First, you learn about the domain, then you create the system design, and finally, you implement it.
I guess this style might work for some people, but I find it hard to learn and plan in a vacuum. I prefer to approach the design as a dialogue. I learn a little about the domain, design a single feature, implement it, then go back and learn about the next thing I want to do. As my understanding of the domain and the challenges improve, I need to go back and revise my earlier decisions.
I think this is crucial. To know how an API design works, you have to get in there, try it out, see how it feels. As you get more experience in the space, your understanding improves and this helps you make better decisions.
Consider this: At the very start of a project is when you have the least information and knowledge about the problem domain — every moment you spend working on it gives you more experience. So why should you do all the planning and decisions in the beginning, when you have the least information.
Dig shallow graves
Instead of planning everything out I pick one small task that the system needs to do and implement it. I try to pick the smallest possible task that I can think of that would still somehow be useful to the end-user. I find it valuable to implement “real features” like this instead of just building out low-level functionality because it tests how well the system integrates with everything else and gives a fairly good picture of how the API works in the “real world”.
After it is done, I take a step back and look at it to see if there are any obvious design issues that need to be fixed before moving on to the next task.
During this process, I fully expect that I will make mistakes and have to backtrack. So an important part is to make sure that if I dig myself into a hole with the design — which I’m likely to do at some point, regardless of how much planning I do — it is a pretty shallow hole that I can easily get myself out of. By taking small steps, one at a time, I never have to backtrack too far — typically the full circle of picking a feature, planning it, and implementing it is done in just a day or two.
Digging shallow graves also means trying to avoid design choices that are hard to back out of if they should prove problematic. Two situations where I’ve seen this happen are:
-
Basing your implementation on a third-party library that does not cover all your use cases. (Can’t do feature X, doesn’t work on platform Y, etc) If you need to change the technology your implementation is built on, you typically need a full rewrite.
-
Not considering performance. A system designed and tested with only 10 elements often becomes unbearably slow when run with 10 000 elements. Sometimes this can be fixed with a few targeted optimizations, but often improving the performance by several orders of magnitude requires a complete redesign.
Complexity comes into play here too. The more complex your implementation is, the harder these necessary refactors and rewrites will be and consciously or unconsciously, you will start to resist them.
Let’s see how this works in a real-world example.
We are currently working on adding debugging support to our visual scripting system. This is a big and complex task with lots of intricate pieces — watch windows, breakpoints, single-stepping, etc. Planning all of this upfront would be daunting.
So instead we start with a minimally useful task: To have nodes in the graph light up as they are being triggered. This gives the user a quick at-a-glance picture of how the graph is running.
Given this task, we design the simplest possible API we can think of:
struct tm_graph_debugging_api {
// Return the node's activity level. The activity level is 1 when the node
// is first activiated and then fades to 0 over time.
float (*node_activity)(tm_tt_id_t node);
// Marks the node as active in the current frame.
void (*set_node_active)(tm_tt_id_t node);
};
The graph view in the editor calls node_activity()
to check the activity level of a node and
highlight the active nodes. The graph interpreter calls set_node_active()
when a node is
triggered to mark it as active.
This API is so ridiculously simple that implementing it is almost trivial. But by doing it, we’ve not only added a useful feature to the editor, we’ve also taken a small step towards having a fully-featured graph debugger. Once this is done, we can pick the next small task, such as displaying the values that are being transmitted on the graph wires.
This simple API might not hold up in the long run. In fact, we’ve already made changes to it. We decided to have it return the time when the node was last activated instead and have the caller interpret that as an activity level:
// Returns the time at which the `node` was last activated.
tm_clock_o (*node_active)(tm_tt_id_t node);
This change is a tradeoff. We’ve made the API more flexible (the caller can now control the fade-out time of highlights) at the cost of making it more complicated to use. (To get the highlight, the caller needs to subtract the returned value from the current time and convert that elapsed time into a highlight value.) These kinds of tradeoffs are inevitable and as the designer, you have to make a call on what is the right choice for your API.
You might think you can get away from the tradeoff by putting both functions in the API. But that is just another kind of tradeoff. This time the tradeoff is a more complex API with more functions that the user has to understand.
We fully expect to make additional changes. For example, we need some way of telling which graph is being debugged and that may change the API further. But I think it is always best to start as simple as possible. It’s always easier to add complexity than to try to take it away.
Test your API and fix your bugs
Related to this, make sure that you test your APIs, by actually using them in production code, and fix any “bugs” that you or others discover.
Bugs in implementations happen when your code has a special case that you didn’t really think about, so it computes the wrong value or crashes.
Bugs in APIs happen when the user of your API doesn’t understand how it’s supposed to be used, so they call it in the wrong way and it computes the wrong value or crashes.
I would venture that bugs in APIs are a bigger issue than bugs in implementations. They are harder to detect and fix so they tend to hang around longer. And a lot of the time, what looks like an implementation bug is just a bug in the interface below it (insufficient documentation, confusing behavior, etc).
A big problem is that programmers often don’t report bugs in APIs. Instead, they just figure out what they did wrong, fix the problem, and go about their day. If you want to catch “bugs” in your API you have to listen carefully to what problems the users are having and modify your API to fix or alleviate those problems.
Sometimes that user is you! I think as programmers we’re so trained to “solve problems and move on” that we sometimes miss the chance to fix an underlying issue. Even when the issue is in our own code and we’re the ones running into it!
Also, note that I say fix or alleviate. Another common trait of programmers is that we often think in binaries and absolutes. If we can’t do something that 100 % guaranteed fixes the issue in every possible case, we say that the problem is logically impossible to solve and do nothing.
I propose to instead think in terms of ergonomics and harm reduction. If we can do something that reduces the risk of misuse of our API by 95 %, we will have saved our users a ton of frustration and headache even though the possibility of misuse is still there. It might be something really simple, like improved documentation or a small change to the API arguments that nudges the user in the right direction.
As an example, I noticed multiple bugs where we were doing enter_critical_section()
but forgetting
to clean up with leave_critical_section()
. If we were using C++, I would have used a stack object
with a destructor to clean up the critical section, but since we are using C I can’t really do that.
So instead I introduced the following macros:
#define TM_OS_ENTER_CRITICAL_SECTION(cs) \
uint32_t TM_OS_LEAVE_CRITICAL_SECTION_is_missing; \
tm_os_api->thread->enter_critical_section(cs);
#define TM_OS_LEAVE_CRITICAL_SECTION(cs) \
(void)TM_OS_LEAVE_CRITICAL_SECTION_is_missing; \
tm_os_api->thread->leave_critical_section(cs);
Now, if a scope has TM_OS_ENTER_CRITICAL_SECTION(cs)
but not TM_OS_LEAVE_CRITICAL_SECTION(cs)
the compiler will print a warning that the variable TM_OS_LEAVE_CRITICAL_SECTION_is_missing
is
unused.
This isn’t perfect, it’s still possible to forget to close the critical section with code like:
{
TM_ENTER_CRITICAL_SECTION(cs);
if (bla)
return;
TM_LEAVE_CRITICAL_SECTION(cs);
}
But we have still significantly reduced the risk of this happening.
We could add additional steps of mitigation to this too. For example, we could record the __FILE__
and __LINE__
whenever we enter a critical section and dump those values if there are any open
critical sections when we reach the main loop.
I think the important part here is that we are not trying to make a perfect design of what a
critical section API should look like. Instead, we’re looking at the issues our users (programmers)
have with the API and what we can do to fix them. I.e., if none of our users ever forgot the
leave()
call, the macros above wouldn’t be necessary. They are a reaction to the actual problems
users are having.
Postpone hard problems
Sometimes in API design, you will run into a Hard Problem. A hard problem is any problem that doesn’t have a simple and straightforward solution so that you are unsure of what the right thing to do is.
Hard problems are dangerous in two ways. First, they can be very time-consuming, taking weeks, months, or years to solve. That might be fine if you have a salaried position at some research institute, but not if you are trying to build a complete game engine with a small team. Second, if you force yourself to implement a solution to a problem when you can’t see a clear and simple solution, there’s a chance that you will implement a muddy and complex solution that reflects your own lack of understanding of the problem. This bad solution will be a millstone around your neck going forward.
So when dealing with hard problems we want to be sure of two things. First, that we only solve the absolutely necessary ones. We only have so much time and energy to spend and we want to make sure we put it in the right place. Second, that we don’t go ahead and implement a complex and convoluted solution to a problem because we missed a simple and straightforward one.
So how do we do that? My favorite approach is to “do nothing”. I.e., instead of actually solving the problem I just put a sticker on it that says “Fix this later”. Surprisingly often, this Zen-like approach will actually solve the problem in one way or another:
-
Sometimes, while I’m taking a walk or a shower, I’ll come up with a simple and straightforward solution.
-
Sometimes, as I work on other parts of the system, and my understanding of the problem space deepens, I realize that actually there is no problem. It only seemed like there was a problem because my understanding was incomplete.
-
Sometimes, I never solve the hard problem and that’s OK. It turns out the system is still useful and provides value even if it doesn’t handle every single imaginable thing.
Postponing hard problems can feel irresponsible, but responsibility isn’t about solving every imaginable problem. It’s about using your time and resources where it matters. You also have to be careful about over-design. Over-design happens when you try to make your API cover every possible use case instead of focusing on the issues that the majority of users care about (there are some great examples in the newer C++ standard libraries).
Let’s look at a concrete example of postponing a hard problem to see how it might work in practice. Consider the Graph Debugging API again. We know that at some point we need a way for the user to specify what to debug. The engine can have multiple Simulation tabs running, and each of those tabs can have multiple enemy entities running around in them, each one running an instance of the enemy script. Now, what if the user wants to debug one specific enemy in one specific simulation to see why it’s behaving erratically? There needs to be some way to “connect” the graph view to a specific entity for debugging, but how should that work?
Obviously, there could be some drop-down menu with a search field where the user could navigate first to a simulation tab and then to a particular entity in that tab… by its entity ID? But that doesn’t seem very elegant. So instead of banging our heads against this issue or implementing something that doesn’t feel great, we just decided that… we’ll fix it later! Maybe we’ll stumble on a better solution along the way. If not, at least we will have a better idea of how we would want things to work after playing around with the system some more.
A small word of caution with regards to postponing hard problems. As mentioned earlier, we want to make sure that we “dig shallow graves”. Make sure you don’t postpone a hard problem that is an essential part of your system in such a way that when you do implement it, everything else needs to be rewritten.
Modifying and versioning APIs
Above, I’ve talked a lot about modifying, redesigning, and evolving APIs. I envision most of this happening during the development of the API, before it is actually released to the public.
But I think you should also have a mechanism for evolving your API after it has been released. As your API gets some actual use, you will undoubtedly discover more design flaws, things confusing the user, and other things you want to fix.
The mechanism can be as simple as deprecating old functions in the API and replacing them with new versions. There are lots of other options too, but that could be an entire article in itself. I just want to emphasize that you should have a path for evolving your API after the release. As always, the most important thing with code is that it is easy to change and you don’t want to be locked in forever by a bad API design choice.