Somewhat to my surprise, the STL matched the set of criteria for a good set of standard containers for C++ that I had developed over the years.
-- Bjarne Stroustrup
Introduction
STL has an unfairly bad reputation in games development circles in my opinion. People that got burnt on the old HP stl that shipped with gcc 2.7.x series was on many platform the norm and that particular implementation coupled with the poor compiler probably left a sour taste for many developers. I think STL as many other things has a place in C++ development, for tools certainly it's a productivity boost and as many of us run their tools on the win32 platform, most of the concerns that we have on consoles are not the dominant part of execution time.
But I think there is a place for STL on the consoles as well, while people might know about std::map and std::string as the most prominent actors for STL, just plain std::vector will often win the day with ease of use and fairly predictable behavior. Not to forget are the algorithms that are part of STL, they are usually much more efficient and robust than your average programmer will produce under time constraints. This article will handle one particular case where we need an associative container, i.e. a key/value pair which we can do lookup by the key with. In other words std::map. The problem with std::map is that it can be quite slow.
Just plain old map.
The std::map is a great little container. For making associative tables at runtime it's ideal in most cases. Underneath there is a AVL tree. Plugging in a custom allocator for fixed sized elements will yield pretty good results in allocation speed as well. For regular tools use, maps are pretty good alternatives when it comes to fast development time and flexibility at runtime.
Looking up elements in the map is basically going to be a tree traversal with a compare at each node, now the complexity of your key is crucial. A std::string will for example kill your performance. Iterating over a tree to just visit all nodes is pretty costly. Basically you need to do a recursive traversal, which can be made iteratively if you use some sort of stack structure. Let just leave it at it's a little bit more complicated than just adding something to a pointer.
I will go to such lengths to state that iterating over large std::maps will cause you a large performance hit. Compared to what? Well, just iterating over plain old C-style arrays or std::vectors with the same information.
The std::vector.
Typically in a game you have the following phases:
- Initialization
- Usage of static structures
- Destruction
Say for example I've got a completely generic system where I store all my game objects in a std::map< std::string, GameObject >. At the beginning I'll create all of them and then during runtime I'll just iterate over them to render them and update them. I really don't care about order at this point, I just want to make sure that I can call a function on all the items in the map. If I convert this to be a std::vector instead I'll actually significantly improve the time spent iterating over the elements each frame. Looking at the following listing that roughly shows the number of instructions needed to just loop over both containers we see that the vector case contains significantly less code. Actually the vector case can be made even simpler since I've just used the index operator directly for std::vector and for some reason my STL vendor decided to add something extra in there. Since I know what I'm doing (famous last words), I can simply take the first element address and index that one instead, thus avoiding STL completely.
004011D0 mov edx,dword ptr [eax] 004011D2 add dword ptr [acc (405424h)],edx 004011D8 add eax,8 004011DB sub ecx,1 004011DE jne runVectorTest+130h (4011D0h)
00401332 mov esi,eax 00401334 cmp edi,ebx 00401336 je runMapTest+0D0h (401340h) 00401338 lea edx,[esp+68h] 0040133C cmp edi,edx 0040133E je runMapTest+0D6h (401346h) 00401340 call dword ptr [__imp___invalid_parameter_noinfo (4040CCh)] 00401346 cmp esi,dword ptr [esp+50h] 0040134A je runMapTest+141h (4013B1h) 0040134C cmp edi,ebx 0040134E jne runMapTest+0E6h (401356h) 00401350 call dword ptr [__imp___invalid_parameter_noinfo (4040CCh)] 00401356 cmp esi,dword ptr [edi+4] 00401359 jne runMapTest+0F1h (401361h) 0040135B call dword ptr [__imp___invalid_parameter_noinfo (4040CCh)] 00401361 mov eax,dword ptr [esi+10h] 00401364 add dword ptr [acc (405424h)],eax 0040136A cmp byte ptr [esi+15h],bl 0040136D je runMapTest+107h (401377h) 0040136F call dword ptr [__imp___invalid_parameter_noinfo (4040CCh)] 00401375 jmp runMapTest+0C4h (401334h) 00401377 mov eax,dword ptr [esi+8] 0040137A cmp byte ptr [eax+15h],bl 0040137D jne runMapTest+123h (401393h) 0040137F mov esi,eax 00401381 mov eax,dword ptr [esi] 00401383 cmp byte ptr [eax+15h],bl 00401386 jne runMapTest+0C4h (401334h) 00401388 mov esi,eax 0040138A mov eax,dword ptr [esi] 0040138C cmp byte ptr [eax+15h],bl 0040138F je runMapTest+118h (401388h) 00401391 jmp runMapTest+0C4h (401334h) 00401393 mov eax,dword ptr [esi+4] 00401396 cmp byte ptr [eax+15h],bl 00401399 jne runMapTest+0C2h (401332h) 0040139B jmp runMapTest+130h (4013A0h) 0040139D lea ecx,[ecx] 004013A0 cmp esi,dword ptr [eax+8] 004013A3 jne runMapTest+0C2h (401332h) 004013A5 mov esi,eax 004013A7 mov eax,dword ptr [eax+4] 004013AA cmp byte ptr [eax+15h],bl 004013AD je runMapTest+130h (4013A0h)
But wait a minute! The attentive reader will complain that, hey the map is an associative container and the vector is just random access container. The trick here lies in using the algorithms, specifically the cryptically named std::equal_range that does a binary search in a sorted range for the value you give it. This allows us to emulate the behavior of std::map, but control the allocations to be like the std::vector, nice and linear. Taking a look at my little test program where I've very quickly tested the different approaches we see the following results:
Vector insert 1708.56 ms Map insert 55.37 ms Vector find 15.23 ms Map find 29.49 ms Vector iterator 0.09 ms Map iterator 1.39 ms
Take these numbers with a grain of salt, looking at my code it's pretty obvious that it is lacking in measurement since it's not really measuring user land time, but rather just the total elapsed time (which includes other processes, which can pollute the cache as well as take time). But it is a rough indication of trends and coupled with knowledge of the assembly and what the containers do, we can venture to explain these numbers.
The insertion of the vectorelements is taking up a huge amount of time! Why? Well, since we have a scattered insertion pattern, the chance of us doing the equivalent of a push_front is pretty high. push_front on a big vector is really expensive since it needs to copy all the elements up one slot in order to make room for the new element in the front. It is a pretty good guess to make that we're spending almost all the time here. On the other hand, the map insert even with the extra calls to the memory allocation system is much faster. Considering that it's never copying data, but just traversing a tree and doing one allocation it's not surprising.
Looking at a search operation for the both of them we see that even though they have the same theoretical complexity, the vector wins. Why is this? My guess is that it's simply have slightly better memory access pattern since it closes in on the search area the deeper the search goes, whereas the map basically always will have a cache miss when it pulls in a new node in the tree.
Last, the visiting of all nodes in the container is not so complicated. The vector will beat the crap out of the map every time. Again, back to the cache, and the algorithms, the vector case is basically just an add to a register whereas the map case is a stack structure and reading of memory locations (and each one is almost guaranteed to be a cache miss).
In closing
Going back to the typical usage of a structures in games, if you can do your sorting at tool time and just load up the ready to go array of data into memory at load time, you don't even need the std::vector container, but can rather just malloc memory directly and go. Both lookups and iterations will be as fast or faster than the std::map. And all because the little function std::equal_range...
Resources:
- map_vs_vector.cpp
- SGI's STL documentation
- Effective STL by Scott Meyers