1 @cindex IEEE floating point
3 This chapter describes functions for examining the representation of
4 floating point numbers and controlling the floating point environment of
5 your program. The functions described in this chapter are declared in
6 the header file @file{gsl_ieee_utils.h}.
9 * Representation of floating point numbers::
10 * Setting up your IEEE environment::
11 * IEEE References and Further Reading::
14 @node Representation of floating point numbers
15 @section Representation of floating point numbers
16 @cindex IEEE format for floating point numbers
17 @cindex bias, IEEE format
18 @cindex exponent, IEEE format
19 @cindex sign bit, IEEE format
20 @cindex mantissa, IEEE format
21 The IEEE Standard for Binary Floating-Point Arithmetic defines binary
22 formats for single and double precision numbers. Each number is composed
23 of three parts: a @dfn{sign bit} (@math{s}), an @dfn{exponent}
24 (@math{E}) and a @dfn{fraction} (@math{f}). The numerical value of the
25 combination @math{(s,E,f)} is given by the following formula,
29 (-1)^s (1 \cdot fffff\dots) 2^E
36 (-1)^s (1.fffff...) 2^E
41 @cindex normalized form, IEEE format
42 @cindex denormalized form, IEEE format
43 The sign bit is either zero or one. The exponent ranges from a minimum value
48 @math{E_max} depending on the precision. The exponent is converted to an
50 @math{e}, known as the @dfn{biased exponent}, for storage by adding a
52 @c{$e = E + \hbox{\it bias}$}
54 The sequence @math{fffff...} represents the digits of the binary
55 fraction @math{f}. The binary digits are stored in @dfn{normalized
56 form}, by adjusting the exponent to give a leading digit of @math{1}.
57 Since the leading digit is always 1 for normalized numbers it is
58 assumed implicitly and does not have to be stored.
62 are be stored in @dfn{denormalized form} with a leading zero,
66 (-1)^s (0 \cdot fffff\dots) 2^{E_{min}}
73 (-1)^s (0.fffff...) 2^(E_min)
78 @cindex zero, IEEE format
79 @cindex infinity, IEEE format
80 This allows gradual underflow down to
82 @math{2^(E_min - p)} for @math{p} bits of precision.
83 A zero is encoded with the special exponent of
85 @math{2^(E_min - 1)} and infinities with the exponent of
90 @cindex single precision, IEEE format
91 The format for single precision numbers uses 32 bits divided in the
95 seeeeeeeefffffffffffffffffffffff
98 e = exponent, 8 bits (E_min=-126, E_max=127, bias=127)
103 @cindex double precision, IEEE format
104 The format for double precision numbers uses 64 bits divided in the
108 seeeeeeeeeeeffffffffffffffffffffffffffffffffffffffffffffffffffff
111 e = exponent, 11 bits (E_min=-1022, E_max=1023, bias=1023)
112 f = fraction, 52 bits
116 It is often useful to be able to investigate the behavior of a
117 calculation at the bit-level and the library provides functions for
118 printing the IEEE representations in a human-readable form.
120 @comment float vs double vs long double
121 @comment (how many digits are available for each)
123 @deftypefun void gsl_ieee_fprintf_float (FILE * @var{stream}, const float * @var{x})
124 @deftypefunx void gsl_ieee_fprintf_double (FILE * @var{stream}, const double * @var{x})
125 These functions output a formatted version of the IEEE floating-point
126 number pointed to by @var{x} to the stream @var{stream}. A pointer is
127 used to pass the number indirectly, to avoid any undesired promotion
128 from @code{float} to @code{double}. The output takes one of the
133 the Not-a-Number symbol
136 positive or negative infinity
138 @item 1.fffff...*2^E, -1.fffff...*2^E
139 a normalized floating point number
141 @item 0.fffff...*2^E, -0.fffff...*2^E
142 a denormalized floating point number
145 positive or negative zero
147 @comment @item [non-standard IEEE float], [non-standard IEEE double]
148 @comment an unrecognized encoding
151 The output can be used directly in GNU Emacs Calc mode by preceding it
152 with @code{2#} to indicate binary.
155 @deftypefun void gsl_ieee_printf_float (const float * @var{x})
156 @deftypefunx void gsl_ieee_printf_double (const double * @var{x})
157 These functions output a formatted version of the IEEE floating-point
158 number pointed to by @var{x} to the stream @code{stdout}.
162 The following program demonstrates the use of the functions by printing
163 the single and double precision representations of the fraction
164 @math{1/3}. For comparison the representation of the value promoted from
165 single to double precision is also printed.
168 @verbatiminclude examples/ieee.c
172 The binary representation of @math{1/3} is @math{0.01010101... }. The
173 output below shows that the IEEE format normalizes this fraction to give
174 a leading digit of 1,
177 f= 1.01010101010101010101011*2^-2
178 fd= 1.0101010101010101010101100000000000000000000000000000*2^-2
179 d= 1.0101010101010101010101010101010101010101010101010101*2^-2
183 The output also shows that a single-precision number is promoted to
184 double-precision by adding zeros in the binary representation.
186 @comment importance of using 1.234L in long double calculations
189 @comment int main (void)
191 @comment long double x = 1.0, y = 1.0;
193 @comment x = x + 0.2;
194 @comment y = y + 0.2L;
196 @comment printf(" d %.20Lf\n",x);
197 @comment printf("ld %.20Lf\n",y);
202 @comment d 1.20000000000000001110
203 @comment ld 1.20000000000000000004
204 @comment @end example
207 @node Setting up your IEEE environment
208 @section Setting up your IEEE environment
209 @cindex IEEE exceptions
210 @cindex precision, IEEE arithmetic
211 @cindex rounding mode
212 @cindex arithmetic exceptions
213 @cindex exceptions, IEEE arithmetic
214 @cindex division by zero, IEEE exceptions
215 @cindex underflow, IEEE exceptions
216 @cindex overflow, IEEE exceptions
217 The IEEE standard defines several @dfn{modes} for controlling the
218 behavior of floating point operations. These modes specify the important
219 properties of computer arithmetic: the direction used for rounding (e.g.
220 whether numbers should be rounded up, down or to the nearest number),
221 the rounding precision and how the program should handle arithmetic
222 exceptions, such as division by zero.
224 Many of these features can now be controlled via standard functions such
225 as @code{fpsetround}, which should be used whenever they are available.
226 Unfortunately in the past there has been no universal API for
227 controlling their behavior---each system has had its own low-level way
228 of accessing them. To help you write portable programs GSL allows you
229 to specify modes in a platform-independent way using the environment
230 variable @code{GSL_IEEE_MODE}. The library then takes care of all the
231 necessary machine-specific initializations for you when you call the
232 function @code{gsl_ieee_env_setup}.
234 @deftypefun void gsl_ieee_env_setup ()
235 This function reads the environment variable @code{GSL_IEEE_MODE} and
236 attempts to set up the corresponding specified IEEE modes. The
237 environment variable should be a list of keywords, separated by
241 @code{GSL_IEEE_MODE} = "@var{keyword},@var{keyword},..."
245 where @var{keyword} is one of the following mode-names,
249 @code{single-precision}
251 @code{double-precision}
253 @code{extended-precision}
255 @code{round-to-nearest}
267 @code{mask-denormalized}
269 @code{mask-division-by-zero}
273 @code{mask-underflow}
280 If @code{GSL_IEEE_MODE} is empty or undefined then the function returns
281 immediately and no attempt is made to change the system's IEEE
282 mode. When the modes from @code{GSL_IEEE_MODE} are turned on the
283 function prints a short message showing the new settings to remind you
284 that the results of the program will be affected.
286 If the requested modes are not supported by the platform being used then
287 the function calls the error handler and returns an error code of
290 When options are specified using this method, the resulting mode is
291 based on a default setting of the highest available precision (double
292 precision or extended precision, depending on the platform) in
293 round-to-nearest mode, with all exceptions enabled apart from the
294 @sc{inexact} exception. The @sc{inexact} exception is generated
295 whenever rounding occurs, so it must generally be disabled in typical
296 scientific calculations. All other floating-point exceptions are
297 enabled by default, including underflows and the use of denormalized
298 numbers, for safety. They can be disabled with the individual
299 @code{mask-} settings or together using @code{mask-all}.
301 The following adjusted combination of modes is convenient for many
305 GSL_IEEE_MODE="double-precision,"\
311 This choice ignores any errors relating to small numbers (either
312 denormalized, or underflowing to zero) but traps overflows, division by
313 zero and invalid operations.
315 Note that on the x86 series of processors this function sets both the
316 original x87 mode and the newer @sc{mxcsr} mode, which controls SSE
317 floating-point operations. The SSE floating-point units do not have a
318 precision-control bit, and always work in double-precision. The
319 single-precision and extended-precision keywords have no effect in
324 To demonstrate the effects of different rounding modes consider the
325 following program which computes @math{e}, the base of natural
326 logarithms, by summing a rapidly-decreasing series,
330 e = 1 + {1 \over 2!} + {1 \over 3!} + {1 \over 4!} + \dots
338 e = 1 + 1/2! + 1/3! + 1/4! + ...
344 @verbatiminclude examples/ieeeround.c
348 Here are the results of running the program in @code{round-to-nearest}
349 mode. This is the IEEE default so it isn't really necessary to specify
353 $ GSL_IEEE_MODE="round-to-nearest" ./a.out
354 i= 1 sum=1.000000000000000000 error=-1.71828
355 i= 2 sum=2.000000000000000000 error=-0.718282
357 i=18 sum=2.718281828459045535 error=4.44089e-16
358 i=19 sum=2.718281828459045535 error=4.44089e-16
362 After nineteen terms the sum converges to within @c{$4 \times 10^{-16}$}
363 @math{4 \times 10^-16} of the correct value.
364 If we now change the rounding mode to
365 @code{round-down} the final result is less accurate,
368 $ GSL_IEEE_MODE="round-down" ./a.out
369 i= 1 sum=1.000000000000000000 error=-1.71828
371 i=19 sum=2.718281828459041094 error=-3.9968e-15
376 @c{$4 \times 10^{-15}$}
377 @math{4 \times 10^-15}
378 below the correct value, an order of magnitude worse than the result
379 obtained in the @code{round-to-nearest} mode.
381 If we change to rounding mode to @code{round-up} then the final result
382 is higher than the correct value (when we add each term to the sum the
383 final result is always rounded up, which increases the sum by at least
384 one tick until the added term underflows to zero). To avoid this
385 problem we would need to use a safer converge criterion, such as
386 @code{while (fabs(sum - oldsum) > epsilon)}, with a suitably chosen
389 Finally we can see the effect of computing the sum using
390 single-precision rounding, in the default @code{round-to-nearest}
391 mode. In this case the program thinks it is still using double precision
392 numbers but the CPU rounds the result of each floating point operation
393 to single-precision accuracy. This simulates the effect of writing the
394 program using single-precision @code{float} variables instead of
395 @code{double} variables. The iteration stops after about half the number
396 of iterations and the final result is much less accurate,
399 $ GSL_IEEE_MODE="single-precision" ./a.out
401 i=12 sum=2.718281984329223633 error=1.5587e-07
407 @math{O(10^-7)}, which corresponds to single
408 precision accuracy (about 1 part in @math{10^7}). Continuing the
409 iterations further does not decrease the error because all the
410 subsequent results are rounded to the same value.
412 @node IEEE References and Further Reading
413 @section References and Further Reading
415 The reference for the IEEE standard is,
419 ANSI/IEEE Std 754-1985, IEEE Standard for Binary Floating-Point Arithmetic.
423 A more pedagogical introduction to the standard can be found in the
428 David Goldberg: What Every Computer Scientist Should Know About
429 Floating-Point Arithmetic. @cite{ACM Computing Surveys}, Vol.@: 23, No.@: 1
430 (March 1991), pages 5--48.
432 Corrigendum: @cite{ACM Computing Surveys}, Vol.@: 23, No.@: 3 (September
433 1991), page 413. and see also the sections by B. A. Wichmann and Charles
434 B. Dunham in Surveyor's Forum: ``What Every Computer Scientist Should
435 Know About Floating-Point Arithmetic''. @cite{ACM Computing Surveys},
436 Vol.@: 24, No.@: 3 (September 1992), page 319.
441 A detailed textbook on IEEE arithmetic and its practical use is
442 available from SIAM Press,
446 Michael L. Overton, @cite{Numerical Computing with IEEE Floating Point Arithmetic},
447 SIAM Press, ISBN 0898715717.
452 @comment to turn on math exception handling use __setfpucw, see
453 @comment /usr/include/i386/fpu_control.h
455 @comment #include <math.h>
456 @comment #include <stdio.h>
457 @comment #include <fpu_control.h>
458 @comment double f (double x);
461 @comment double a = 0;
462 @comment double y, z;
463 @comment __setfpucw(0x1372);
464 @comment mention extended vs double precision on Pentium, and how to get around
465 @comment it (-ffloat-store, or selective use of volatile)
467 @comment On the alpha the option -mieee is needed with gcc
468 @comment In Digital's compiler the equivalent is -ieee, -ieee-with-no-inexact
469 @comment and -ieee-with-inexact, or -fpe1 or -fpe2