Nonintuitive result of the assignment of a double precision number to an int variable in C

c floating-point type-conversion implicit-conversion

... why I get two different numbers ...

Aside from the usual float-point issues, the computation paths to b and c are arrived in different ways. c is calculated by first saving the value as double a.

double a =(Vmax-Vmin)/step;int b = (Vmax-Vmin)/step;int c = a;

C allows intermediate floating-point math to be computed using wider types. Check the value of FLT_EVAL_METHOD from <float.h>.

Except for assignment and cast (which remove all extra range and precision), ...
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the type;
1 evaluate operations and constants of type float and double to the range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type;
2 evaluate all operations and constants to the range and precision of the long double type.
C11dr §5.2.4.2.2 9

OP reported 2

By saving the quotient in double a = (Vmax-Vmin)/step;, precision is forced to double whereas int b = (Vmax-Vmin)/step; could compute as long double.

This subtle difference results from (Vmax-Vmin)/step (computed perhaps as long double) being saved as a double versus remaining a long double. One as 15 (or just above), and the other just under 15. int truncation amplifies this difference to 15 and 14.

On another compiler, the results may both have been the same due to FLT_EVAL_METHOD < 2 or other floating-point characteristics.

Conversion to int from a floating-point number is severe with numbers near a whole number. Often better to round() or lround(). The best solution is situation dependent.

c floating-point type-conversion implicit-conversion

This is indeed an interesting question, here is what happens precisely in your hardware. This answer gives the exact calculations with the precision of IEEE double precision floats, i.e. 52 bits mantissa plus one implicit bit. For details on the representation, see the wikipedia article.

Ok, so you first define some variables:

double Vmax = 2.9;double Vmin = 1.4;double step = 0.1;

The respective values in binary will be

Vmax =    10.111001100110011001100110011001100110011001100110011Vmin =    1.0110011001100110011001100110011001100110011001100110step = .00011001100110011001100110011001100110011001100110011010

If you count the bits, you will see that I have given the first bit that is set plus 52 bits to the right. This is exactly the precision at which your computer stores a double. Note that the value of step has been rounded up.

Now you do some math on these numbers. The first operation, the subtraction, results in the precise result:

 10.111001100110011001100110011001100110011001100110011- 1.0110011001100110011001100110011001100110011001100110--------------------------------------------------------  1.1000000000000000000000000000000000000000000000000000

Then you divide by step, which has been rounded up by your compiler:

   1.1000000000000000000000000000000000000000000000000000 /  .00011001100110011001100110011001100110011001100110011010--------------------------------------------------------1110.1111111111111111111111111111111111111111111111111100001111111111111

Due to the rounding of step, the result is a tad below 15. Unlike before, I have not rounded immediately, because that is precisely where the interesting stuff happens: Your CPU can indeed store floating point numbers of greater precision than a double, so rounding does not take place immediately.

So, when you convert the result of (Vmax-Vmin)/step directly to an int, your CPU simply cuts off the bits after the fractional point (this is how the implicit double -> int conversion is defined by the language standards):

               1110.1111111111111111111111111111111111111111111111111100001111111111111cutoff to int: 1110

However, if you first store the result in a variable of type double, rounding takes place:

               1110.1111111111111111111111111111111111111111111111111100001111111111111rounded:       1111.0000000000000000000000000000000000000000000000000cutoff to int: 1111

And this is precisely the result you got.

c floating-point type-conversion implicit-conversion

The "simple" answer is that those seemingly-simple numbers 2.9, 1.4, and 0.1 are all represented internally as binary floating point, and in binary, the number 1/10 is represented as the infinitely-repeating binary fraction 0.00011001100110011...[2] . (This is analogous to the way 1/3 in decimal ends up being 0.333333333... .) Converted back to decimal, those original numbers end up being things like 2.8999999999, 1.3999999999, and 0.0999999999. And when you do additional math on them, those .0999999999's tend to proliferate.

And then the additional problem is that the path by which you compute something -- whether you store it in intermediate variables of a particular type, or compute it "all at once", meaning that the processor might use internal registers with greater precision than type double -- can end up making a significant difference.

The bottom line is that when you convert a double back to an int, you almost always want to round, not truncate. What happened here was that (in effect) one computation path gave you 15.0000000001 which truncated down to 15, while the other gave you 14.999999999 which truncated all the way down to 14.

See also question 14.4a in the C FAQ list.

CodeHunter

Nonintuitive result of the assignment of a double precision number to an int variable in C

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last