Somewhat to my surprise, the STL matched the set of criteria for a good set of standard containers for C++ that I had developed over the years.
-- Bjarne Stroustrup

Introduction

STL has an unfairly bad reputation in games development circles in my opinion. People that got burnt on the old HP stl that shipped with gcc 2.7.x series was on many platform the norm and that particular implementation coupled with the poor compiler probably left a sour taste for many developers. I think STL as many other things has a place in C++ development, for tools certainly it's a productivity boost and as many of us run their tools on the win32 platform, most of the concerns that we have on consoles are not the dominant part of execution time.

But I think there is a place for STL on the consoles as well, while people might know about std::map and std::string as the most prominent actors for STL, just plain std::vector will often win the day with ease of use and fairly predictable behavior. Not to forget are the algorithms that are part of STL, they are usually much more efficient and robust than your average programmer will produce under time constraints. This article will handle one particular case where we need an associative container, i.e. a key/value pair which we can do lookup by the key with. In other words std::map. The problem with std::map is that it can be quite slow.

Just plain old map.

The std::map is a great little container. For making associative tables at runtime it's ideal in most cases. Underneath there is a AVL tree. Plugging in a custom allocator for fixed sized elements will yield pretty good results in allocation speed as well. For regular tools use, maps are pretty good alternatives when it comes to fast development time and flexibility at runtime.

Looking up elements in the map is basically going to be a tree traversal with a compare at each node, now the complexity of your key is crucial. A std::string will for example kill your performance. Iterating over a tree to just visit all nodes is pretty costly. Basically you need to do a recursive traversal, which can be made iteratively if you use some sort of stack structure. Let just leave it at it's a little bit more complicated than just adding something to a pointer.

I will go to such lengths to state that iterating over large std::maps will cause you a large performance hit. Compared to what? Well, just iterating over plain old C-style arrays or std::vectors with the same information.


The std::vector.

Typically in a game you have the following phases:

  1. Initialization
  2. Usage of static structures
  3. Destruction

Say for example I've got a completely generic system where I store all my game objects in a std::map< std::string, GameObject >. At the beginning I'll create all of them and then during runtime I'll just iterate over them to render them and update them. I really don't care about order at this point, I just want to make sure that I can call a function on all the items in the map. If I convert this to be a std::vector instead I'll actually significantly improve the time spent iterating over the elements each frame. Looking at the following listing that roughly shows the number of instructions needed to just loop over both containers we see that the vector case contains significantly less code. Actually the vector case can be made even simpler since I've just used the index operator directly for std::vector and for some reason my STL vendor decided to add something extra in there. Since I know what I'm doing (famous last words), I can simply take the first element address and index that one instead, thus avoiding STL completely.

 
004011D0  mov         edx,dword ptr [eax] 
004011D2  add         dword ptr [acc (405424h)],edx 
004011D8  add         eax,8 
004011DB  sub         ecx,1 
004011DE  jne         runVectorTest+130h (4011D0h) 		
        
Listing 1: Inner loop for iterating over a std::vector.
 
00401332  mov         esi,eax 
00401334  cmp         edi,ebx 
00401336  je          runMapTest+0D0h (401340h) 
00401338  lea         edx,[esp+68h] 
0040133C  cmp         edi,edx 
0040133E  je          runMapTest+0D6h (401346h) 
00401340  call        dword ptr [__imp___invalid_parameter_noinfo (4040CCh)] 
00401346  cmp         esi,dword ptr [esp+50h] 
0040134A  je          runMapTest+141h (4013B1h) 
0040134C  cmp         edi,ebx 
0040134E  jne         runMapTest+0E6h (401356h) 
00401350  call        dword ptr [__imp___invalid_parameter_noinfo (4040CCh)] 
00401356  cmp         esi,dword ptr [edi+4] 
00401359  jne         runMapTest+0F1h (401361h) 
0040135B  call        dword ptr [__imp___invalid_parameter_noinfo (4040CCh)] 
00401361  mov         eax,dword ptr [esi+10h] 
00401364  add         dword ptr [acc (405424h)],eax 
0040136A  cmp         byte ptr [esi+15h],bl 
0040136D  je          runMapTest+107h (401377h) 
0040136F  call        dword ptr [__imp___invalid_parameter_noinfo (4040CCh)] 
00401375  jmp         runMapTest+0C4h (401334h) 
00401377  mov         eax,dword ptr [esi+8] 
0040137A  cmp         byte ptr [eax+15h],bl 
0040137D  jne         runMapTest+123h (401393h) 
0040137F  mov         esi,eax 
00401381  mov         eax,dword ptr [esi] 
00401383  cmp         byte ptr [eax+15h],bl 
00401386  jne         runMapTest+0C4h (401334h) 
00401388  mov         esi,eax 
0040138A  mov         eax,dword ptr [esi] 
0040138C  cmp         byte ptr [eax+15h],bl 
0040138F  je          runMapTest+118h (401388h) 
00401391  jmp         runMapTest+0C4h (401334h) 
00401393  mov         eax,dword ptr [esi+4] 
00401396  cmp         byte ptr [eax+15h],bl 
00401399  jne         runMapTest+0C2h (401332h) 
0040139B  jmp         runMapTest+130h (4013A0h) 
0040139D  lea         ecx,[ecx] 
004013A0  cmp         esi,dword ptr [eax+8] 
004013A3  jne         runMapTest+0C2h (401332h) 
004013A5  mov         esi,eax 
004013A7  mov         eax,dword ptr [eax+4] 
004013AA  cmp         byte ptr [eax+15h],bl 
004013AD  je          runMapTest+130h (4013A0h) 
		
Listing 2: Inner loop for iterating over a std::map.

But wait a minute! The attentive reader will complain that, hey the map is an associative container and the vector is just random access container. The trick here lies in using the algorithms, specifically the cryptically named std::equal_range that does a binary search in a sorted range for the value you give it. This allows us to emulate the behavior of std::map, but control the allocations to be like the std::vector, nice and linear. Taking a look at my little test program where I've very quickly tested the different approaches we see the following results:

 
Vector insert              1708.56 ms
Map insert                   55.37 ms
Vector find                  15.23 ms
Map find                     29.49 ms
Vector iterator               0.09 ms
Map iterator                  1.39 ms
Listing 3: Timeoutput from the test program

Take these numbers with a grain of salt, looking at my code it's pretty obvious that it is lacking in measurement since it's not really measuring user land time, but rather just the total elapsed time (which includes other processes, which can pollute the cache as well as take time). But it is a rough indication of trends and coupled with knowledge of the assembly and what the containers do, we can venture to explain these numbers.

The insertion of the vectorelements is taking up a huge amount of time! Why? Well, since we have a scattered insertion pattern, the chance of us doing the equivalent of a push_front is pretty high. push_front on a big vector is really expensive since it needs to copy all the elements up one slot in order to make room for the new element in the front. It is a pretty good guess to make that we're spending almost all the time here. On the other hand, the map insert even with the extra calls to the memory allocation system is much faster. Considering that it's never copying data, but just traversing a tree and doing one allocation it's not surprising.

Looking at a search operation for the both of them we see that even though they have the same theoretical complexity, the vector wins. Why is this? My guess is that it's simply have slightly better memory access pattern since it closes in on the search area the deeper the search goes, whereas the map basically always will have a cache miss when it pulls in a new node in the tree.

Last, the visiting of all nodes in the container is not so complicated. The vector will beat the crap out of the map every time. Again, back to the cache, and the algorithms, the vector case is basically just an add to a register whereas the map case is a stack structure and reading of memory locations (and each one is almost guaranteed to be a cache miss).

In closing

Going back to the typical usage of a structures in games, if you can do your sorting at tool time and just load up the ready to go array of data into memory at load time, you don't even need the std::vector container, but can rather just malloc memory directly and go. Both lookups and iterations will be as fast or faster than the std::map. And all because the little function std::equal_range...

Resources:

Comments