This stuff is fairly basic, and exists as a base for future posts mostly.
If you’re an old, early-twenties fogey such as myself, you can remember back to a time when it cost a few grand to buy a computer which, instead of using a hard disk drive, took floppy disks the size of your head. Still, even in that ancient age, people knew that “computers talk in ones and zeros”. That’s right. If you don’t know the first thing about computers, you still know they speak in binary.
These ones and zeros work really well for integral values, but computers will choke and gag on fractional values more often than you realize. And I’m not just talking about ultra tiny fractions representing the width of a grain of sand in miles… even for simple stuff. Consider the following classic example:
#include <iostream>
int main(int argc, char *argv[])
{
float one_tenth = 0.1f;
float total = 0.0f;
for(int i=0;i<1000;i++)
total += one_tenth;
std::cout << total;
std::cin.get();
}
The program is pretty straightforward. Take a variable (set to 0) and add 0.1 to it a thousand times. If you don’t feel like whipping out a calculator and adding 0.1 over and over again, then trust me–the answer will be 100. So we hit compile, kick back, and get ready to bask in the glow of the 100 that’s about to be printed to our screen:
99.99
Ahh… enjoy that beautiful 100 glow. Wait, 99.99? What? I mean it’s just adding 0.1 a bunch, so there wouldn’t be any rounding errors, right? RIGHT?!?
The price is WRONG!
I’ll assume that if you’re reading this, you’re capable of at least converting integers from base 2 to base 10 and back. If not, here is a random link I pulled off of Google. Also, remember that “decimal points” in binary are “binary points”, and that “decimal points” in any given number base can be called “radix points”. Radix is a pretty cool word… kinda makes me think of a ninja throwing star or something. Sweet.
Also, assume 32-bit word size, blah blah.
One thing I’ll assume you’re NOT familiar with is how to represent a non-integral number, like maybe 30.3125, let’s say. This isn’t a big deal, so we’ll just get to it.
Binary Fractions
We know how to convert the 30 part to binary, so we’ll do that and just write those numbers down and stick a radix point after it:
0001 1110 . ????
That’s 16 + 8 + 4 + 2 = 30. Easy! The question marks are what we need to figure out.
The stuff that comes after the radix point essentially follows the same pattern as the stuff that came before, except in reverse, and as a reciprocal, and you don’t repeat the 1. Okay, maybe that doesn’t sound very simple, but it looks like this:
0 0 1 0 . 0 1 0 0
8 4 2 1 1/2 1/4 1/8 1/16
which would be 2.25, for example.
So for our original number, 30.3125, we throw away the 30 since we figured that out already. This gives us:
0.3125
Okay, great. But how the hell do we figure out what fractions (out of the ones we can choose from, 1/(2^n)) add up to .3125?
The trick is to multiply the number by 2, use the ‘integer part’ of the fraction as our binary number for a given place, and repeat this process until we run out of numbers or notice a cycle. Again, this sounds way harder than it actually is:
0.3125 * 2 = 0.625
0.625 * 2 = 1.25
0.25 * 2 = 0.5
0.5 * 2 = 1.0
Done!!
Taking the integer part of the products gives us 0101. Viola! The part of our binary number after the radix point is 0101, or 1/4 + 1/16.
Thus, our complete number is 0001 1100 . 0101 = 30.3125. BAM!
Repeating Binary Representations
So what does this all have to do with our screwy program that adds up 0.01 a bunch of times?
Well while we think about that, why don’t we convert this 0.01 number to binary to test out or new converting skills!
ALL right so:
0.1 * 2 = 0.2
0.2 * 2 = 0.4
0.4 * 2 = 0.8
0.8 * 2 = 1.6
0.6 * 2 = 1.2
0.2 * 2 = 0.4
0.4 * 2 = 0.8
0.8 * 2 = 1.6
0.6 * 2 = 1.2
0.2 * 2 = 0.4
0.4 * 2 = 0.8
0.8 * 2 = 1.6
0.6 * 2 = 1.2
...
…starting to notice anything? Yep, that’s right… we’re stuck in a loop. The number we’re getting looks like this:
0 . 0001 1001 1001 1001 ...
That’s right. 0.1, one of the simplest, easy to work with base 10 fractions actually cannot be fully represented in binary, much like 1/3 in our decimal system. Since a computer can only store so many bits of the number, error will accumulate over time and our answer will end up off the mark.
I won’t get into it, but the sad fact of the matter is that most decimal to binary conversions end up this way.
Note: This strictly-mathematical way is not how computers store floating point numbers internally. Fixed point numbers would look like this, but modern day personal computers generally work with a IEEE floating point specification (754). Future posts will go into that. This actually isn’t a big deal for the particular problem we’re dealing with here.
Epsilon Clamps
In order for our program to exhibit proper behavior, we have to do something. But what?
Well, there’s really not too much that we can do about the rounding error–so we’ll assume that when we operate using fractions, we won’t always get the exact answer we expect. What we need to do, then, is figure out when a value is “close enough”.
In our program, for instance, we expect the answer to be 100. We can’t strictly test for 100 (if(total == 100.0f)) because there’s a good chance that we’ll never actually get a total of exactly 100. So instead, we check to see if the total is “sufficiently close” to 100.
But what’s “sufficiently close”? In this case, our answer was off by a tenth or so, so we can set or EPSILON, or the amount of variance that we’ll allow, to 0.2. If our total ends up between 99.8 and 100.2, then we’re set!
Sadly, most epsilons are picked this way–as some “arbitrarily small value”. It’s really too bad, too, because it can be a mess. If we change the loop in the program presented to iterate 50,000 times, we expect our answer to be 5,000. Factoring in our epsilon, we’ll accept anything from 4999.8 to 5000.2. Our computed answer is actually:
5002.53
Which is definitely is NOT within our epsilon.
So how do we fix this problem? Tune in some arbitrarily long time later for the next episode of AS THE EPSILON EPSES… Epsilates? Epsifilutes? Meh… just check back later.
James | 06-Apr-08 at 4:16 pm | Permalink
We’re just starting this in my system architecture class. I guess I’m pretty spoiled working with higher level languages like ActionScript and Java. Getting used to switching between bases like 10, 2, 8, and 16 sounds like it will take a lot of time and patience. Unless I have a calculator handy that does it for me. Hahah!
sahan | 18-Apr-08 at 11:59 pm | Permalink
i m also facing the same problem. i want to multiply two floating point numbers 0.4*0.4 Instead of giving the answer 0.16 it gives 0.1600000011 which creates a problm in our programme…
Douglas Thompson | 24-Apr-08 at 8:59 pm | Permalink
James - The calc program bundled with Windows converts to/from hex/decimal/binary. I use that a lot. Also, the hex editor I use (for when I deal with reading/generating binary files) does this automatically in the status bar. The actual math isn’t too bad, but I don’t find myself doing it too much outside of the academic environment. Still doesn’t hurt to know it in a pinch, though
sahan - Yeah… it’s rough. You can always just clamp it if precision isn’t an issue and you can be sure that your epsilon is always going to be big enough, but not “too big”. Most people just fudge it this way and it usually works “well enough”… but there’s gotta be a better solution out there.
Rezmason | 09-May-08 at 10:36 pm | Permalink
For situations requiring perfect accuracy, we can create a Fraction class, with a _numerator:uint and a _denominator:uint, with getters for the numerator:uint, denominator:uint and approximateValue:Number. The numerator(value:uint) and denominator(value:uint) setters would call a private function reduce() that would reduce the fraction. A small pile of overloaded operators would be handy, but ultimately unnecessary.
The Fraction class could be subclassed to a ComplexNumber representation without much trouble. These classes would give the most accurate floating point estimates of their numerical values. Taking this to the extreme, however, and allowing the introduction of irrational constants would require a means of representing things like “π^2+3*e÷4″ as an associated cluster of numerical objects, whose reduce() methods would be monstrous, at which point any programmer would say, “Forget it, let’s use floating point.”