Aurora: Determinism in your tools

How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?
-- Sir Arthur Conan Doyle (1859 - 1930), (Sherlock Holmes) The Sign of Four, 1890

Warnings indicating non determinism

I'm a real nut when it comes to warnings. I absolutely love them when I compile code. They help to keep me honest, not taking shortcuts through the codebase. Unused variables? Fine, that should really be taken out or marked as unused. Making a codebase that previously was not compiled with warnings turned on is a pain. No, really. I mean a pain. It can be done however, but there is of course a cost associated with it. Prepare to have a programmer spend a couple of weeks with it at least to fix (not suppress) the warnings, since they sometimes can reveal some pretty darn bad decisions. Well, the returning reader probably already know of my warning craze.

I once worked on a project where it was deemed that warnings were simply an annoyance. That might have been fine, except the team also practiced advanced C++¹. Too advanced in my opinion, member function pointers to virtual functions in multiple inherited classes make me run for the woods ever since. I did spend a large amount of time on that project cleaning it up for moving over to a new compiler, which of course handled things slightly differently and exposed tons of bugs (and quite a few in the multiple inheritance case. Now that was fun).

One of the most important warnings is the "uninitialized variable" warning. This is the source of many a evenings lost for programmers searching for seemingly innocent bits making it's way into the data pipe. In debug builds often the uninitialized variables are initialized to some bogus pattern by either the compiler or the OS. Usually you can turn on that the compiler clears all the backing stack memory for you and the OS usually clears things that you allocate out of the heaps to well known values. It's not always that this happens though and then "random" bits can appear in your variables. Worst of all is when you make logic choices on uninitialized variables, since that changes potentially huge parts of the code flow.

Deterministic programming

Determinism in the game itself can be used to easier reproduce bugs, gather performance data and even for gameplay itself. In modern games it does take a little bit of work if you have not had continuous tests during development to catch errors like when a game diverges. Recently I've had to do this as part of our graphics tests at work where we run the game in a special test mode and make sure that recent changes have not broken any rendering features. While the system in a half working state already have caught a surprising amount of bugs, the real challenge is to make the system reliable. For a long while it was a very natural inclination to assume that the test system was incorrect in some way and we had false negatives on our hands. This causes a bad negative feedback loop where the tests are assumed to be bogus and eventually people stop paying attention to them. It's a little bit like the broken window syndrome. So naturally, I spent a lot of time to make the system 99% proof and when it reports any failure it should be highly reliable to be an actual breakage.

It turns out though that a lot of effort had to be put into making the tools deterministic. With a few cheats in the runtime like making the loading code non-async and accepting spikes during the game frames it's pretty easy to make the game itself deterministic. That is, with a given set of inputs the output is always the same. In our case the input is a level and the output is a screenshot (or multiple ones). It turns out though that we every now and then saw that some alpha blended rays of light sometimes got drawn in different order. Which of course should never happen. Yet it did.

Of course this failure happened only every now and then on the test server itself. Trying to reproduce it locally revealed that indeed given the same input (our level files) the game always produced the same images. So either there were cosmic rays hitting the test server or just maybe the level files were not the same if we build the same level twice? A quick binary diff revealed that of course they were not! Since we have a pretty simple IFF loading format, it was pretty easy to write a small python script that could binary diff the chunks themselves and exactly pinpoint what was wrong. Unfortunately there were a lot of differences, much more than I first expected. I started out to fix most of these and here are some of the findings.

Sources of non determinism.

Padding

10 years ago it was not such a big deal with alignment, on proper systems apart from x86 you had alignment or else the processor dumped the core! Today we have more subtle alignment requirements, the ALTIVEC load instructions for example often just mask of the lower 4 bits of the address silently. So it doesn't crash, but you will most often not wind up with the values you expected in the vector (oh, did your light calculations go off? It's kind of dark here, hmm). On the PS3 there is a quirk that if you properly align your data you could get a 2x speed boost in the DMA transfers. In short, alignment is really all around in the code. Often in order to achieve the alignment we want we pad out structures.

Couple the padding with writing datastructures straight down to disk, for later reading in your engine and you suddenly have random bits going down to disk! Better idea is to make sure that the padding bits are always initialized to some known value. You can do this only in the tools of course.

This applies to for example vectors where the W component needs to be 1 in some cases. Just ensure that the component is 1 in the tools and don't worry about it in the engine.

One thing that might make it easy to make changes like this is if you've split out your codebase to have separate structures for tools and runtime. I'm a firm believer in clear separation between the tools and runtime, apart from magic constants I usually don't try to share much of the data structures between the two. Treating them as two separate platforms is usually a very good idea.

Really, the tools have different needs and often less information about the structures than the runtime. It's common to just write down metadata about for example the models before the actual data, enabling the runtime to be spoon feed all the necessary data. Really, the dumber the runtime is, often the faster it does :) And this applies to loadtimes on consoles as well, we're all chasing the zero loadtime experience for the user (or I think you should) so that the player actually plays the game more and not spend it waiting for data to come in from the disc.

But that also means that you need to have to move the smarts to the tools. Often this means that you need to store more data than absolutely necessary to handle the playback ingame. Sometimes this is not a huge deal though. Judge from case to case.

Guard variables

Values that are potentially unused and guarded by some other (correctly initialized) variable. Sometime you can see code like this:

		if (bIsLightEnabled) {
			color += lightColor;
		}

If lightColor isn't initialized at all times, even when the guard variable bIsLightEnabled is set to false, this could cause you to chase the proverbial red herring (I like pickled herrings, but not this kind). Here it's not a logical error to sometimes not initialize the variable "lightColor", but it's bad practice. The logic might change by some other author. It's also again, causing potential non determinism.

Dumb assed mistakes

Sometimes there are just no excuses :) It simply didn't work, but appeared to work. The code doesn't initialize the variable and we use it in the engine. It's not uncommon when the writing to the variable and the reading of the variable are so far removed from each other, one in the tool and the other in the runtime. Here compiler warnings are the winning concept.

In closing

There is very little magic inside computer programs. Randomness is usually introduced explicitly, or through some fairly easy to spot mechanisms (reading time, network influences, hardware random generators, relying on threading, std::rand!). Actually, there is a whole research area concerned with introducing randomness (as applied to cryptography and these guys don't kid around, ::rand will not be accepted). Really simplified, much aching to approximating the proverbial cow to a sphere, the computer program can be seen as just a black box transformer with no internal state. Given the same input, it will produce the same output. Of course, it might not always be easy to identify those inputs, nor to guarantee that they are the same.

When dealing with the test machine at work, we had to deal with not only the runtime of the game, but also the tools and what happens from run to run, which could be hours and even days apart. Writing down things as simple as a 64 bit integer representing the current time could cause drifts to creep into the runtime.

However now that we've cleared that stage, the next level promises less errors in the renderer, less broken shaders and more confidence that the renderer works. Which is not such a bad thing since the renderer really is the core of the game, one of the components without it's very hard to work... how are you testing your renderer?

Footnotes

As opposed from the devil's C++, straight from the red devil book itself.