htswanalysis/MACS/lib/gsl/gsl-1.11/doc/ieee754.texi

   1 @cindex IEEE floating point
   2
   3 This chapter describes functions for examining the representation of
   4 floating point numbers and controlling the floating point environment of
   5 your program.  The functions described in this chapter are declared in
   6 the header file @file{gsl_ieee_utils.h}.
   7
   8 @menu
   9 * Representation of floating point numbers::
  10 * Setting up your IEEE environment::
  11 * IEEE References and Further Reading::
  12 @end menu
  13
  14 @node Representation of floating point numbers
  15 @section Representation of floating point numbers
  16 @cindex IEEE format for floating point numbers
  17 @cindex bias, IEEE format
  18 @cindex exponent, IEEE format
  19 @cindex sign bit, IEEE format
  20 @cindex mantissa, IEEE format
  21 The IEEE Standard for Binary Floating-Point Arithmetic defines binary
  22 formats for single and double precision numbers.  Each number is composed
  23 of three parts: a @dfn{sign bit} (@math{s}), an @dfn{exponent}
  24 (@math{E}) and a @dfn{fraction} (@math{f}).  The numerical value of the
  25 combination @math{(s,E,f)} is given by the following formula,
  26 @tex
  27 \beforedisplay
  28 $$
  29 (-1)^s (1 \cdot fffff\dots) 2^E
  30 $$
  31 \afterdisplay
  32 @end tex
  33 @ifinfo
  34
  35 @example
  36 (-1)^s (1.fffff...) 2^E
  37 @end example
  38
  39 @end ifinfo
  40 @noindent
  41 @cindex normalized form, IEEE format
  42 @cindex denormalized form, IEEE format
  43 The sign bit is either zero or one.  The exponent ranges from a minimum value
  44 @c{$E_{min}$}
  45 @math{E_min}
  46 to a maximum value
  47 @c{$E_{max}$}
  48 @math{E_max} depending on the precision.  The exponent is converted to an
  49 unsigned number
  50 @math{e}, known as the @dfn{biased exponent}, for storage by adding a
  51 @dfn{bias} parameter,
  52 @c{$e = E + \hbox{\it bias}$}
  53 @math{e = E + bias}.
  54 The sequence @math{fffff...} represents the digits of the binary
  55 fraction @math{f}.  The binary digits are stored in @dfn{normalized
  56 form}, by adjusting the exponent to give a leading digit of @math{1}.
  57 Since the leading digit is always 1 for normalized numbers it is
  58 assumed implicitly and does not have to be stored.
  59 Numbers smaller than
  60 @c{$2^{E_{min}}$}
  61 @math{2^(E_min)}
  62 are be stored in @dfn{denormalized form} with a leading zero,
  63 @tex
  64 \beforedisplay
  65 $$
  66 (-1)^s (0 \cdot fffff\dots) 2^{E_{min}}
  67 $$
  68 \afterdisplay
  69 @end tex
  70 @ifinfo
  71
  72 @example
  73 (-1)^s (0.fffff...) 2^(E_min)
  74 @end example
  75
  76 @end ifinfo
  77 @noindent
  78 @cindex zero, IEEE format
  79 @cindex infinity, IEEE format
  80 This allows gradual underflow down to
  81 @c{$2^{E_{min} - p}$}
  82 @math{2^(E_min - p)} for @math{p} bits of precision.
  83 A zero is encoded with the special exponent of
  84 @c{$2^{E_{min}-1}$}
  85 @math{2^(E_min - 1)} and infinities with the exponent of
  86 @c{$2^{E_{max}+1}$}
  87 @math{2^(E_max + 1)}.
  88
  89 @noindent
  90 @cindex single precision, IEEE format
  91 The format for single precision numbers uses 32 bits divided in the
  92 following way,
  93
  94 @smallexample
  95 seeeeeeeefffffffffffffffffffffff
  96
  97 s = sign bit, 1 bit
  98 e = exponent, 8 bits  (E_min=-126, E_max=127, bias=127)
  99 f = fraction, 23 bits
 100 @end smallexample
 101
 102 @noindent
 103 @cindex double precision, IEEE format
 104 The format for double precision numbers uses 64 bits divided in the
 105 following way,
 106
 107 @smallexample
 108 seeeeeeeeeeeffffffffffffffffffffffffffffffffffffffffffffffffffff
 109
 110 s = sign bit, 1 bit
 111 e = exponent, 11 bits  (E_min=-1022, E_max=1023, bias=1023)
 112 f = fraction, 52 bits
 113 @end smallexample
 114
 115 @noindent
 116 It is often useful to be able to investigate the behavior of a
 117 calculation at the bit-level and the library provides functions for
 118 printing the IEEE representations in a human-readable form.
 119
 120 @comment float vs double vs long double
 121 @comment (how many digits are available for each)
 122
 123 @deftypefun void gsl_ieee_fprintf_float (FILE * @var{stream}, const float * @var{x})
 124 @deftypefunx void gsl_ieee_fprintf_double (FILE * @var{stream}, const double * @var{x})
 125 These functions output a formatted version of the IEEE floating-point
 126 number pointed to by @var{x} to the stream @var{stream}. A pointer is
 127 used to pass the number indirectly, to avoid any undesired promotion
 128 from @code{float} to @code{double}.  The output takes one of the
 129 following forms,
 130
 131 @table @code
 132 @item NaN
 133 the Not-a-Number symbol
 134
 135 @item Inf, -Inf
 136 positive or negative infinity
 137
 138 @item 1.fffff...*2^E, -1.fffff...*2^E
 139 a normalized floating point number
 140
 141 @item 0.fffff...*2^E, -0.fffff...*2^E
 142 a denormalized floating point number
 143
 144 @item 0, -0
 145 positive or negative zero
 146
 147 @comment @item [non-standard IEEE float], [non-standard IEEE double]
 148 @comment an unrecognized encoding
 149 @end table
 150
 151 The output can be used directly in GNU Emacs Calc mode by preceding it
 152 with @code{2#} to indicate binary.
 153 @end deftypefun
 154
 155 @deftypefun void gsl_ieee_printf_float (const float * @var{x})
 156 @deftypefunx void gsl_ieee_printf_double (const double * @var{x})
 157 These functions output a formatted version of the IEEE floating-point
 158 number pointed to by @var{x} to the stream @code{stdout}.
 159 @end deftypefun
 160
 161 @noindent
 162 The following program demonstrates the use of the functions by printing
 163 the single and double precision representations of the fraction
 164 @math{1/3}.  For comparison the representation of the value promoted from
 165 single to double precision is also printed.
 166
 167 @example
 168 @verbatiminclude examples/ieee.c
 169 @end example
 170
 171 @noindent
 172 The binary representation of @math{1/3} is @math{0.01010101... }.  The
 173 output below shows that the IEEE format normalizes this fraction to give
 174 a leading digit of 1,
 175
 176 @smallexample
 177  f= 1.01010101010101010101011*2^-2
 178 fd= 1.0101010101010101010101100000000000000000000000000000*2^-2
 179  d= 1.0101010101010101010101010101010101010101010101010101*2^-2
 180 @end smallexample
 181
 182 @noindent
 183 The output also shows that a single-precision number is promoted to
 184 double-precision by adding zeros in the binary representation.
 185
 186 @comment importance of using 1.234L in long double calculations
 187
 188 @comment @example
 189 @comment int main (void)
 190 @comment @{
 191 @comment   long double x = 1.0, y = 1.0;
 192
 193 @comment   x = x + 0.2;
 194 @comment   y = y + 0.2L;
 195
 196 @comment   printf(" d %.20Lf\n",x);
 197 @comment   printf("ld %.20Lf\n",y);
 198
 199 @comment   return 1;
 200 @comment @}
 201
 202 @comment  d 1.20000000000000001110
 203 @comment ld 1.20000000000000000004
 204 @comment @end example
 205
 206
 207 @node Setting up your IEEE environment
 208 @section Setting up your IEEE environment
 209 @cindex IEEE exceptions
 210 @cindex precision, IEEE arithmetic
 211 @cindex rounding mode
 212 @cindex arithmetic exceptions
 213 @cindex exceptions, IEEE arithmetic
 214 @cindex division by zero, IEEE exceptions
 215 @cindex underflow, IEEE exceptions
 216 @cindex overflow, IEEE exceptions
 217 The IEEE standard defines several @dfn{modes} for controlling the
 218 behavior of floating point operations.  These modes specify the important
 219 properties of computer arithmetic: the direction used for rounding (e.g.
 220 whether numbers should be rounded up, down or to the nearest number),
 221 the rounding precision and how the program should handle arithmetic
 222 exceptions, such as division by zero.
 223
 224 Many of these features can now be controlled via standard functions such
 225 as @code{fpsetround}, which should be used whenever they are available.
 226 Unfortunately in the past there has been no universal API for
 227 controlling their behavior---each system has had its own low-level way
 228 of accessing them.  To help you write portable programs GSL allows you
 229 to specify modes in a platform-independent way using the environment
 230 variable @code{GSL_IEEE_MODE}.  The library then takes care of all the
 231 necessary machine-specific initializations for you when you call the
 232 function @code{gsl_ieee_env_setup}.
 233
 234 @deftypefun void gsl_ieee_env_setup ()
 235 This function reads the environment variable @code{GSL_IEEE_MODE} and
 236 attempts to set up the corresponding specified IEEE modes.  The
 237 environment variable should be a list of keywords, separated by
 238 commas, like this,
 239
 240 @display
 241 @code{GSL_IEEE_MODE} = "@var{keyword},@var{keyword},..."
 242 @end display
 243
 244 @noindent
 245 where @var{keyword} is one of the following mode-names,
 246
 247 @itemize @asis
 248 @item
 249 @code{single-precision}
 250 @item
 251 @code{double-precision}
 252 @item
 253 @code{extended-precision}
 254 @item
 255 @code{round-to-nearest}
 256 @item
 257 @code{round-down}
 258 @item
 259 @code{round-up}
 260 @item
 261 @code{round-to-zero}
 262 @item
 263 @code{mask-all}
 264 @item
 265 @code{mask-invalid}
 266 @item
 267 @code{mask-denormalized}
 268 @item
 269 @code{mask-division-by-zero}
 270 @item
 271 @code{mask-overflow}
 272 @item
 273 @code{mask-underflow}
 274 @item
 275 @code{trap-inexact}
 276 @item
 277 @code{trap-common}
 278 @end itemize
 279
 280 If @code{GSL_IEEE_MODE} is empty or undefined then the function returns
 281 immediately and no attempt is made to change the system's IEEE
 282 mode.  When the modes from @code{GSL_IEEE_MODE} are turned on the
 283 function prints a short message showing the new settings to remind you
 284 that the results of the program will be affected.
 285
 286 If the requested modes are not supported by the platform being used then
 287 the function calls the error handler and returns an error code of
 288 @code{GSL_EUNSUP}.
 289
 290 When options are specified using this method, the resulting mode is
 291 based on a default setting of the highest available precision (double
 292 precision or extended precision, depending on the platform) in
 293 round-to-nearest mode, with all exceptions enabled apart from the
 294 @sc{inexact} exception.  The @sc{inexact} exception is generated
 295 whenever rounding occurs, so it must generally be disabled in typical
 296 scientific calculations.  All other floating-point exceptions are
 297 enabled by default, including underflows and the use of denormalized
 298 numbers, for safety.  They can be disabled with the individual
 299 @code{mask-} settings or together using @code{mask-all}.
 300
 301 The following adjusted combination of modes is convenient for many
 302 purposes,
 303
 304 @example
 305 GSL_IEEE_MODE="double-precision,"\
 306                 "mask-underflow,"\
 307                   "mask-denormalized"
 308 @end example
 309
 310 @noindent
 311 This choice ignores any errors relating to small numbers (either
 312 denormalized, or underflowing to zero) but traps overflows, division by
 313 zero and invalid operations.
 314
 315 Note that on the x86 series of processors this function sets both the
 316 original x87 mode and the newer @sc{mxcsr} mode, which controls SSE
 317 floating-point operations.  The SSE floating-point units do not have a
 318 precision-control bit, and always work in double-precision.  The
 319 single-precision and extended-precision keywords have no effect in
 320 this case.
 321 @end deftypefun
 322
 323 @noindent
 324 To demonstrate the effects of different rounding modes consider the
 325 following program which computes @math{e}, the base of natural
 326 logarithms, by summing a rapidly-decreasing series,
 327 @tex
 328 \beforedisplay
 329 $$
 330 e = 1 + {1 \over 2!} + {1 \over 3!} + {1 \over 4!} + \dots
 331   = 2.71828182846...
 332 $$
 333 \afterdisplay
 334 @end tex
 335 @ifinfo
 336
 337 @example
 338 e = 1 + 1/2! + 1/3! + 1/4! + ...
 339   = 2.71828182846...
 340 @end example
 341 @end ifinfo
 342
 343 @example
 344 @verbatiminclude examples/ieeeround.c
 345 @end example
 346
 347 @noindent
 348 Here are the results of running the program in @code{round-to-nearest}
 349 mode.  This is the IEEE default so it isn't really necessary to specify
 350 it here,
 351
 352 @example
 353 $ GSL_IEEE_MODE="round-to-nearest" ./a.out
 354 i= 1 sum=1.000000000000000000 error=-1.71828
 355 i= 2 sum=2.000000000000000000 error=-0.718282
 356 ....
 357 i=18 sum=2.718281828459045535 error=4.44089e-16
 358 i=19 sum=2.718281828459045535 error=4.44089e-16
 359 @end example
 360
 361 @noindent
 362 After nineteen terms the sum converges to within @c{$4 \times 10^{-16}$}
 363 @math{4 \times 10^-16} of the correct value.
 364 If we now change the rounding mode to
 365 @code{round-down} the final result is less accurate,
 366
 367 @example
 368 $ GSL_IEEE_MODE="round-down" ./a.out
 369 i= 1 sum=1.000000000000000000 error=-1.71828
 370 ....
 371 i=19 sum=2.718281828459041094 error=-3.9968e-15
 372 @end example
 373
 374 @noindent
 375 The result is about
 376 @c{$4 \times 10^{-15}$}
 377 @math{4 \times 10^-15}
 378 below the correct value, an order of magnitude worse than the result
 379 obtained in the @code{round-to-nearest} mode.
 380
 381 If we change to rounding mode to @code{round-up} then the final result
 382 is higher than the correct value (when we add each term to the sum the
 383 final result is always rounded up, which increases the sum by at least
 384 one tick until the added term underflows to zero).  To avoid this
 385 problem we would need to use a safer converge criterion, such as
 386 @code{while (fabs(sum - oldsum) > epsilon)}, with a suitably chosen
 387 value of epsilon.
 388
 389 Finally we can see the effect of computing the sum using
 390 single-precision rounding, in the default @code{round-to-nearest}
 391 mode.  In this case the program thinks it is still using double precision
 392 numbers but the CPU rounds the result of each floating point operation
 393 to single-precision accuracy.  This simulates the effect of writing the
 394 program using single-precision @code{float} variables instead of
 395 @code{double} variables.  The iteration stops after about half the number
 396 of iterations and the final result is much less accurate,
 397
 398 @example
 399 $ GSL_IEEE_MODE="single-precision" ./a.out
 400 ....
 401 i=12 sum=2.718281984329223633 error=1.5587e-07
 402 @end example
 403
 404 @noindent
 405 with an error of
 406 @c{$O(10^{-7})$}
 407 @math{O(10^-7)}, which corresponds to single
 408 precision accuracy (about 1 part in @math{10^7}).  Continuing the
 409 iterations further does not decrease the error because all the
 410 subsequent results are rounded to the same value.
 411
 412 @node IEEE References and Further Reading
 413 @section References and Further Reading
 414
 415 The reference for the IEEE standard is,
 416
 417 @itemize @asis
 418 @item
 419 ANSI/IEEE Std 754-1985, IEEE Standard for Binary Floating-Point Arithmetic.
 420 @end itemize
 421
 422 @noindent
 423 A more pedagogical introduction to the standard can be found in the
 424 following paper,
 425
 426 @itemize @asis
 427 @item
 428 David Goldberg: What Every Computer Scientist Should Know About
 429 Floating-Point Arithmetic. @cite{ACM Computing Surveys}, Vol.@: 23, No.@: 1
 430 (March 1991), pages 5--48.
 431
 432 Corrigendum: @cite{ACM Computing Surveys}, Vol.@: 23, No.@: 3 (September
 433 1991), page 413. and see also the sections by B. A. Wichmann and Charles
 434 B. Dunham in Surveyor's Forum: ``What Every Computer Scientist Should
 435 Know About Floating-Point Arithmetic''. @cite{ACM Computing Surveys},
 436 Vol.@: 24, No.@: 3 (September 1992), page 319.
 437 @end itemize
 438
 439 @noindent
 440
 441 A detailed textbook on IEEE arithmetic and its practical use is
 442 available from SIAM Press,
 443
 444 @itemize @asis
 445 @item
 446 Michael L. Overton, @cite{Numerical Computing with IEEE Floating Point Arithmetic},
 447 SIAM Press, ISBN 0898715717.
 448 @end itemize
 449
 450 @noindent
 451
 452 @comment to turn on math exception handling use __setfpucw, see
 453 @comment /usr/include/i386/fpu_control.h
 454 @comment e.g.
 455 @comment #include <math.h>
 456 @comment #include <stdio.h>
 457 @comment #include <fpu_control.h>
 458 @comment double f (double x);
 459 @comment int main ()
 460 @comment {
 461 @comment   double a = 0;
 462 @comment   double y, z;
 463 @comment   __setfpucw(0x1372);
 464 @comment mention extended vs double precision on Pentium, and how to get around
 465 @comment it (-ffloat-store, or selective use of volatile)
 466
 467 @comment On the alpha the option -mieee is needed with gcc
 468 @comment In Digital's compiler the equivalent is -ieee, -ieee-with-no-inexact
 469 @comment and -ieee-with-inexact,  or -fpe1 or -fpe2