These days there’s this well-seated OOP craze in the CS world. OOP is far from new, but in the relatively recent past (10-20 years?) it has gone from “Oh, that” to rock stardom. Everyone loves to “abstract away from implementation details”. The idea sounds great on paper, but sometimes it’ll bite ya in the ass.
The standard C++ library has a bunch of built in data containers to make your life easier. The most common one is probably the vector. Vectors store their contents in a contiguous block of memory. They can be iterated through very efficiently and make your cache much happier. Vectors also come with a bunch of operations. Some of my favorites are the push_back and pop functions.
How do these functions work internally? Ahh, that’s just a tiny implementation detail! Or IS it?
Let’s say we have the following program:
#include <iostream>
#include <vector>
int main(int argc, char* argv[])
{
std::vector<char> vowels;
std::cout << "Setting up our vector of vicious vowels!" << std::endl;
vowels.push_back('a');
vowels.push_back('e');
vowels.push_back('i');
vowels.push_back('o');
char* favorite_vowel = &vowels[2]; // Our favorite vowel is i!
std::cout << "Favorite vowel: " << *favorite_vowel << std::endl;
std::cout << "WAIT! We forgot to include 'u'. Let's stick it on the end of the array!" << std::endl;
vowels.push_back('u');
std::cout << "Okay, but my favorite vowel is still: " << *favorite_vowel << std::endl;
std::cin.get();
return 0;
}
Your output may be different than mine, but when I compile and run this program, I get the following:
Setting up our vector of vicious vowels!
Favorite vowel: i
WAIT! We forgot to include 'u'. Let's stick it on the end of the array!
Okay, but my favorite vowel is still: 3
Wait a second… my favorite vowel isn’t 3. In fact, 3 isn’t a vowel–it isn’t even a letter! And what’s more, there is no 3 in our entire program. So where the hell is our 3 coming from?
The Problem
So to figure out what’s happening here, consider the following, similar program:
#include <iostream>
#include <iomanip>
#include <vector>
int main(int argc, char* argv[])
{
std::vector<int> numbers;
for(int i=0;i<10;i++)
{
numbers.push_back(i);
std::cout << "Our vector starts at: " << std::hex << (int)&numbers[0] << std::endl;
}
std::cin.get();
return 0;
}
The program doesn’t really do anything useful–it just allocates a vector and pushes the numbers 0-9 onto it. After this push_back operation, we print the address of the first element of our vector (where our vector begins). Again, your results will probably be different, but I get:
Our vector starts at: 3333c8
Our vector starts at: 3335f0
Our vector starts at: 333410
Our vector starts at: 3335f0
Our vector starts at: 333410
Our vector starts at: 333410
Our vector starts at: 3335f0
Our vector starts at: 3335f0
Our vector starts at: 3335f0
Our vector starts at: 333410
So our vector is actually moving around in memory! This explains why our pointer “went bad” in the first example. After we pushed the ‘u’ onto the vector, the vector decided to move somewhere else, so our pointer dangled. When we dereferenced it, we got a 3. The 3 doesn’t mean anything. We could have gotten a 4, or a Z, or an ascii smiley face, or a whole string of random garbage, or nothing, or anything! Instead of saying “the pointer went bad”, we say “the pointer was invalidated”, and that because it no longer pointers to allocated space (probably), we say that the pointer is “dangling”.
When you look at the memory addresses that we get in the second example, you’ll notice that there really isn’t any discernible pattern. Our vector moves after the first, second, third, fourth, sixth and ninth loops–sometimes to a completely new spot, sometimes to a previously allocated spot. So what does this mean, and why is it happening?
The Solution
To figure this one out, we need to understand the implementation details of the vector container.
Vectors satisfy everyone’s bloodthirsty lust for a ‘dynamic array’. Arrays by definition are contained in a contiguous block of memory. For this reason, vectors also store their contents within a contiguous block of memory. This means that if we have a vector of chars, and the first element of the vector is stored at memory location 1000, the *next* element will for sure be stored at memory location 1001 (remember, chars take up 1 byte). There are a lot of advantages to doing things this way–iteration through the array is very fast, iterating through the array doesn’t kick your cache’s ass, you can directly access members of the array with no additional overhead beyond maybe an addition, etc. There are also disadvantages–poor insertion and removal times for items in the middle of the array, for instance. But the biggest disadvantage is that you just can’t make an array bigger. Period.
Vectors carry an array internally. When you try to add more stuff than this array can hold, the vector has to do *something*. This is one of those implementation details that we shouldn’t have to worry too much about, but it’s actually really important for us to know what happens in this case. The vector’s solution is to simply allocate a BIGGER array, copy all of the original array’s contents into the new one, and then delete the old array. When this happens, you get a brand new array, possibly at a different spot in memory, making any pointers to the old array invalid.
The second thing to realize is that this reallocation can be very, very expensive. For our ‘array of numbers’, it isn’t too bad–but imagine a bigger array with objects that are really expensive to create/destroy. For that situation, reallocation is a real big deal.
The compiler has to make a call. When the vector’s array is about to overflow, and the vector needs to allocate a new array, it can allocate a new array with a single extra block to store the new data. If it does this, though, then the very next time an element is added, it’ll have to do the whole reallocation again! Since we’ve said that reallocating can be expensive, why not just allocate another 10 data-type-sized blocks of memory? Why not 100, or 1000, or ONE MILLION BLOCKS OF MEMORY (feel free to raise your pinky to your mouth). I don’t know what the official ruling on the topic is–and it may be implementation defined–but the compiler will try to decide what a good amount is. What this means is that you cannot reliably predict “when” reallocation is going to happen, so you don’t know when your pointers are going to be invalidated. What you assume, then, is that certain operations “may” cause invalidation, and assume that your pointers are probably bad after these operations happen. push_back() is one of those operations. Others splicing a list, inserting into the middle of a vector, etc.
One More Thing
C++ vectors allow the user to specify how much internal space is allocated for a vector upfront, but will default to some value on their own. For example, let’s say you wanted to store “Day” objects in a “week” vector. There are 7 days in most weeks, so you could specify that a vector have a capacity of 7 so it would never need to ‘reallocate’ space as you added more objects, unless you decided to add an 8th object (but you never would).
The way you reserve memory for use in a vector is with the reserve() function. This will change the vector’s capacity() independently of its size(). The vector’s size() measures how many elements are in the vector’s array. The vector’s capacity() reports the amount of space that has been set aside for the vector with reserve(). For instance, if you add this line after the vector’s declaration in the second program:
numbers.reserve(10);
You get output along the lines of:
Our vector starts at: 3333c8
Our vector starts at: 3333c8
Our vector starts at: 3333c8
Our vector starts at: 3333c8
Our vector starts at: 3333c8
Our vector starts at: 3333c8
Our vector starts at: 3333c8
Our vector starts at: 3333c8
Our vector starts at: 3333c8
Our vector starts at: 3333c8
No reallocation, no invalidation! Better performance *and* no unexpected behavior!
And all you had to do was learn about the implementation details that you’re not supposed to know about.
This is true for other containers as well. Pointers to items in a list won’t invalidate when you add new items because there’s no reason to reallocate the entire linked list–that just isn’t how they work.
Iterators
So to finish off, we mention iterators because it’s in the post title.
Loosely speaking, iterators are just pointers. LOOSELY SPEAKING. What this means is that the same rules apply. Don’t perform a push_back operation on a vector while iterating through it, for instance, as the iterator may become invalid, leading to confusing, likely disasterous (but not necessarily so!) results.
As a general rule, iterators should be created, used, and then discarded. Even though they *can* be used in the same ways as pointers, they generally shouldn’t. They should be used for iterating. For instance, iterating through our numbers vector might look like this:
typedef std::vector<int>::iterator const_int_vec_iter;
for(const_int_vec_iter iter = numbers.begin(); iter != numbers.end(); iter++)
std::cout << *iter << std::endl;
We declare, use, and discard our iterator all in those two lines. We don’t use any invalidating functions, and we could easily replace our data container with another–let’s say a list–and this loop would still work just fine. In fact, if we remove the first cout statement in the program (which indexes into our vector… can’t do that with a list), and add our list header, then entire program recompiles and works with a list.
So using iterators is great, just don’t invalidate them!
Post a Comment