htswanalysis/MACS/lib/gsl/gsl-1.11/doc/fftalgorithms.tex

   1 \documentclass[fleqn,12pt]{article}
   2 %
   3 \setlength{\oddsidemargin}{-0.25in}
   4 \setlength{\textwidth}{7.0in}
   5 \setlength{\topmargin}{-0.25in}
   6 \setlength{\textheight}{9.5in}
   7 %
   8 \usepackage{algorithmic}
   9 \newenvironment{algorithm}{\begin{quote} %\vspace{1em}
  10 \begin{algorithmic}\samepage}{\end{algorithmic} %\vspace{1em}
  11 \end{quote}}
  12 \newcommand{\Real}{\mathop{\mathrm{Re}}}
  13 \newcommand{\Imag}{\mathop{\mathrm{Im}}}
  14
  15 \begin{document}
  16 \title{FFT Algorithms}
  17 \author{Brian Gough, {\tt bjg@network-theory.co.uk}}
  18 \date{May 1997}
  19 \maketitle
  20
  21 \section{Introduction}
  22
  23 Fast Fourier Transforms (FFTs) are efficient algorithms for
  24 calculating the discrete fourier transform (DFT),
  25 %
  26 \begin{eqnarray}
  27 h_a &=& \mathrm{DFT}(g_b) \\
  28     &=& \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N) \qquad 0 \leq a \leq N-1\\
  29     &=& \sum_{b=0}^{N-1} g_b W_N^{ab} \qquad W_N= \exp(-2\pi i/N)
  30 \end{eqnarray}
  31 %
  32 The DFT usually arises as an approximation to the continuous fourier
  33 transform when functions are sampled at discrete intervals in space or
  34 time. The naive evaluation of the discrete fourier transform is a
  35 matrix-vector multiplication ${\mathbf W}\vec{g}$, and would take
  36 $O(N^2)$ operations for $N$ data-points. The general principle of the
  37 Fast Fourier Transform algorithms is to use a divide-and-conquer
  38 strategy to factorize the matrix $W$ into smaller sub-matrices,
  39 typically reducing the operation count to $O(N \sum f_i)$ if $N$ can
  40 be factorized into smaller integers, $N=f_1 f_2 \dots f_n$.
  41
  42 This chapter explains the algorithms used in the GSL FFT routines
  43 and provides some information on how to extend them. To learn more about
  44 the FFT you should read the review article {\em Fast Fourier
  45 Transforms: A Tutorial Review and A State of the Art} by Duhamel and
  46 Vetterli~\cite{duhamel90}. There are several introductory books on the
  47 FFT with example programs, such as {\em The Fast Fourier Transform} by
  48 Brigham~\cite{brigham74} and {\em DFT/FFT and Convolution Algorithms}
  49 by Burrus and Parks~\cite{burrus84}. In 1979 the IEEE published a
  50 compendium of carefully-reviewed Fortran FFT programs in {\em Programs
  51 for Digital Signal Processing}~\cite{committee79} which is a useful
  52 reference for implementations of many different FFT algorithms. If you
  53 are interested in using DSPs then the {\em Handbook of Real-Time Fast
  54 Fourier Transforms}~\cite{smith95} provides detailed information on
  55 the algorithms and hardware needed to design, build and test DSP
  56 applications. Many FFT algorithms rely on results from number theory.
  57 These results are covered in the books {\em Fast transforms:
  58 algorithms, analyses, applications}, by Elliott and
  59 Rao~\cite{elliott82}, {\em Fast Algorithms for Digital Signal
  60 Processing} by Blahut~\cite{blahut} and {\em Number Theory in Digital
  61 Signal Processing} by McClellan and Rader~\cite{mcclellan79}. There is
  62 also an annotated bibliography of papers on the FFT and related topics
  63 by Burrus~\cite{burrus-note}.
  64
  65 \section{Families of FFT algorithms}
  66 %
  67 There are two main families of FFT algorithms: the Cooley-Tukey
  68 algorithm and the Prime Factor algorithm. These differ in the way they
  69 map the full FFT into smaller sub-transforms. Of the Cooley-Tukey
  70 algorithms there are two types of routine in common use: mixed-radix
  71 (general-$N$) algorithms and radix-2 (power of 2) algorithms. Each
  72 type of algorithm can be further classified by additional characteristics,
  73 such as whether it operates in-place or uses additional scratch space,
  74 whether its output is in a sorted or scrambled order, and whether it
  75 uses decimation-in-time or -frequency iterations.
  76
  77 Mixed-radix algorithms work by factorizing the data vector into
  78 shorter lengths. These can then be transformed by small-$N$ FFTs.
  79 Typical programs include FFTs for small prime factors, such as 2, 3,
  80 5, \dots which are highly optimized. The small-$N$ FFT modules act as
  81 building blocks and can be multiplied together to make longer
  82 transforms. By combining a reasonable set of modules it is possible to
  83 compute FFTs of many different lengths. If the small-$N$ modules are
  84 supplemented by an $O(N^2)$ general-$N$ module then an FFT of any
  85 length can be computed, in principle. Of course, any lengths which
  86 contain large prime factors would perform only as $O(N^2)$.
  87
  88 Radix-2 algorithms, or ``power of two'' algorithms, are simplified
  89 versions of the mixed-radix algorithm. They are restricted to lengths
  90 which are a power of two. The basic radix-2 FFT module only involves
  91 addition and subtraction, so the algorithms are very simple. Radix-2
  92 algorithms have been the subject of much research into optimizing the
  93 FFT. Many of the most efficient radix-2 routines are based on the
  94 ``split-radix'' algorithm. This is actually a hybrid which combines
  95 the best parts of both radix-2 and radix-4 (``power of 4'')
  96 algorithms~\cite{sorenson86,duhamel86}.
  97
  98 The prime factor algorithm (PFA) is an alternative form of general-$N$
  99 algorithm based on a different way of recombining small-$N$ FFT
 100 modules~\cite{temperton83pfa,temperton85}. It has a very simple indexing
 101 scheme which makes it attractive. However it only works in the case
 102 where all factors are mutually prime. This requirement makes it more
 103 suitable as a specialized algorithm for given lengths.
 104
 105
 106 \begin{subsection}{FFTs of prime lengths}
 107 %
 108   Large prime lengths cannot be handled efficiently by any of these
 109 algorithms. However it may still possible to compute a DFT, by using
 110 results from number theory. Rader showed that it is possible to
 111 convert a length-$p$ FFT (where $p$ is prime) into a convolution of
 112 length-($p-1$).  There is a simple identity between the convolution of
 113 length $N$ and the FFT of the same length, so if $p-1$ is easily
 114 factorizable this allows the convolution to be computed efficiently
 115 via the FFT. The idea is presented in the original paper by
 116 Rader~\cite{raderprimes} (also reprinted in~\cite{mcclellan79}), but
 117 for more details see the theoretical books mentioned earlier.
 118 \end{subsection}
 119
 120 %\subsection{In-place algorithms vs algorithms using scratch space}
 121 %%
 122 %For simple algorithms, such as the Radix-2 algorithms, the FFT can be
 123 %performed in-place, without additional memory. For this to be possible
 124 %the index calculations must be simple enough that the calculation of
 125 %the result to be stored in index $k$ can be carried out at the same
 126 %time as the data in $k$ is used, so that no temporary storage is
 127 %needed.
 128
 129 %The mixed-radix algorithm is too complicated for this. It requires an
 130 %area of scratch space sufficient to hold intermediate copies of the
 131 %input data.
 132
 133 %\subsection{Self-sorting algorithms vs scrambling algorithms}
 134 %%
 135 %A self-sorting algorithm At each iteration of the FFT internal permutations are included
 136 %which leave the final iteration in its natural order.
 137
 138
 139 %The mixed-radix algorithm can be made self-sorting. The additional
 140 %scratch space helps here, although it is in fact possible to design
 141 %self-sorting algorithms which do not require scratch
 142 %space.
 143
 144
 145 %The in-place radix-2 algorithm is not self-sorting. The data is
 146 %permuted, and transformed into bit-reversed order. Note that
 147 %bit-reversal only refers to the order of the array, not the values for
 148 %each index which are of course not changed. More precisely, the data
 149 %stored in location $[b_n \dots b_2 b_1 b_0]$ is moved to location
 150 %$[b_0 b_1 b_2 \dots b_n]$, where $b_0 \dots b_n$ are the raw bits of a
 151 %given index. Placing the data in bit reversed order is easily done,
 152 %using right shifts to extract the least significant bits of the index
 153 %and left-shifts to feed them into a register for the bit-reversed
 154 %location. Simple radix-2 FFT usually recompute the bit reverse
 155 %ordering in the naive way for every call. For repeated calls it might
 156 %be worthwhile to precompute the permutation and cache it. There are
 157 %also more sophisticated algorithms which reduce the number of
 158 %operations needed to perform bit-reversal~\cite{bit-reversal}.
 159
 160
 161 %\begin{subsection}{Decimation-in-time (DIT) vs Decimation-in-Frequency (DIF)}
 162
 163 %\end{subsection}
 164
 165
 166 \subsection{Optimization}
 167 %
 168 There is no such thing as the single fastest FFT {\em algorithm}. FFT
 169 algorithms involve a mixture of floating point calculations, integer
 170 arithmetic and memory access. Each of these operations will have
 171 different relative speeds on different platforms.  The performance of
 172 an algorithm is a function of the hardware it is implemented on.  The
 173 goal of optimization is thus to choose the algorithm best suited to
 174 the characteristics of a given hardware platform.
 175
 176 For example, the Winograd Fourier Transform (WFTA) is an algorithm
 177 which is designed to reduce the number of floating point
 178 multiplications in the FFT. However, it does this at the expense of
 179 using many more additions and data transfers than other algorithms.
 180 As a consequence the WFTA might be a good candidate algorithm for
 181 machines where data transfers occupy a negligible time relative to
 182 floating point arithmetic. However on most modern machines, where the
 183 speed of data transfers is comparable to or slower than floating point
 184 operations, it would be outperformed by an algorithm which used a
 185 better mix of operations (i.e. more floating point operations but
 186 fewer data transfers).
 187
 188 For a study of this sort of effect in detail, comparing the different
 189 algorithms on different platforms consult the paper {\it Effects of
 190 Architecture Implementation on DFT Algorithm Performance} by Mehalic,
 191 Rustan and Route~\cite{mehalic85}. The paper was written in the early
 192 1980's and has data for super- and mini-computers which you are
 193 unlikely to see today, except in a museum. However, the methodology is
 194 still valid and it would be interesting to see similar results for
 195 present day computers.
 196
 197
 198 \section{FFT Concepts}
 199 %
 200 Factorization is the key principle of the mixed-radix FFT divide-and-conquer
 201 strategy. If $N$ can be factorized into a product of $n_f$ integers,
 202 %
 203 \begin{equation}
 204 N = f_1 f_2 ... f_{n_f} ,
 205 \end{equation}
 206 %
 207 then the FFT itself can be divided into smaller FFTs for each factor.
 208 More precisely, an FFT of length $N$ can be broken up into,
 209 %
 210 \begin{quote}
 211 \begin{tabular}{l}
 212 $(N/f_1)$ FFTs of length $f_1$, \\
 213 $(N/f_2)$ FFTs of length $f_2$, \\
 214 \dots \\
 215 $(N/f_{n_f})$ FFTs of length $f_{n_f}$.
 216 \end{tabular}
 217 \end{quote}
 218 %
 219 The total number of operations for these sub-operations will be $O(
 220 N(f_1 + f_2 + ... + f_{n_f}))$. When the factors of $N$ are all small
 221 integers this will be substantially less than $O(N^2)$. For example,
 222 when $N$ is a power of 2 an FFT of length $N=2^m$ can be reduced to $m
 223 N/2$ FFTs of length 2, or $O(N\log_2 N)$ operations.  Here is a
 224 demonstration which shows this:
 225
 226 We start with the full DFT,
 227 %
 228 \begin{eqnarray}
 229 h_a &=& \sum_{b=0}^{N-1} g_b W_N^{ab}       \qquad W_N=\exp(-2\pi i/N)
 230 \end{eqnarray}
 231 %
 232 and split the sum into even and odd terms,
 233 %
 234 \begin{eqnarray}
 235 \phantom{h_a}
 236 %   &=& \left(\sum_{b=0,2,4,\dots} + \sum_{b=1,3,5,\dots}\right) g_b W_N^{ab}\\
 237    &=& \sum_{b=0}^{N/2-1} g_{2b} W_N^{a(2b)} +
 238       \sum_{b=0}^{N/2-1} g_{2b+1} W_N^{a(2b+1)}.
 239 \end{eqnarray}
 240 %
 241 This converts the original DFT of length $N$ into two DFTs of length
 242 $N/2$,
 243 %
 244 \begin{equation}
 245 h_a = \sum_{b=0}^{N/2-1} g_{2b} W_{(N/2)}^{ab} +
 246       W_N^a \sum_{b=0}^{N/2-1} g_{2b+1} W_{(N/2)}^{ab}
 247 \end{equation}
 248 %
 249 The first term is a DFT of the even elements of $g$. The second term
 250 is a DFT of the odd elements of $g$, premultiplied by an exponential
 251 factor $W_N^k$ (known as a {\em twiddle factor}).
 252 %
 253 \begin{equation}
 254 \mathrm{DFT}(h)  =  \mathrm{DFT}(g_{even}) + W_N^k \mathrm{DFT}(g_{odd})
 255 \end{equation}
 256 %
 257 By splitting the DFT into its even and odd parts we have reduced the
 258 operation count from $N^2$ (for a DFT of length $N$) to $2 (N/2)^2$
 259 (for two DFTs of length $N/2$). The cost of the splitting is that we
 260 need an additional $O(N)$ operations to multiply by the twiddle factor
 261 $W_N^k$ and recombine the two sums.
 262
 263 We can repeat the splitting procedure recursively $\log_2 N$ times
 264 until the full DFT is reduced to DFTs of single terms. The DFT of a
 265 single value is just the identity operation, which costs nothing.
 266 However since $O(N)$ operations were needed at each stage to recombine
 267 the even and odd parts the total number of operations to obtain the
 268 full DFT is $O(N \log_2 N)$. If we had used a length which was a
 269 product of factors $f_1$, $f_2$, $\dots$ we could have split the sum
 270 in a similar way. First we would split terms corresponding to the
 271 factor $f_1$, instead of the even and odd terms corresponding to a
 272 factor of two.  Then we would repeat this procedure for the subsequent
 273 factors. This would lead to a final operation count of $O(N \sum
 274 f_i)$.
 275
 276 This procedure gives some motivation for why the number of operations
 277 in a DFT can in principle be reduced from $O(N^2)$ to $O(N \sum f_i)$.
 278 It does not give a good explanation of how to implement the algorithm
 279 in practice which is what we shall do in the next section.
 280
 281 \section{Radix-2 Algorithms}
 282 %
 283 For radix-2 FFTs it is natural to write array indices in binary form
 284 because the length of the data is a power of two. This is nicely
 285 explained in the article {\em The FFT: Fourier Transforming One Bit at
 286 a Time} by P.B. Visscher~\cite{visscher96}. A binary representation
 287 for indices is the key to deriving the simplest efficient radix-2
 288 algorithms.
 289
 290 We can write an index $b$ ($0 \leq b < 2^{n-1}$) in binary
 291 representation like this,
 292 %
 293 \begin{equation}
 294 b = [b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + \dots + 2 b_1 + b_0 .
 295 \end{equation}
 296 %
 297 Each of the $b_0, b_1, \dots, b_{n-1}$ are the bits (either 0 or 1) of
 298 $b$.
 299
 300 Using this notation the original definition of the DFT
 301 can be rewritten as a sum over the bits of $b$,
 302 %
 303 \begin{equation}
 304 h(a) = \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N)
 305 \end{equation}
 306 %
 307 to give an equivalent summation like this,
 308 %
 309 \begin{equation}
 310 h([a_{n-1} \dots a_1 a_0]) =
 311 \sum_{b_0=0}^{1}
 312 \sum_{b_1=0}^{1}
 313 \dots
 314 \sum_{b_{n-1}=0}^{1}
 315  g([b_{n-1} \dots b_1 b_0]) W_N^{ab}
 316 \end{equation}
 317 %
 318 where the bits of $a$ are $a=[a_{n-1} \dots a_1 a_0]$.
 319
 320 To reduce the number of operations in the sum we will use the
 321 periodicity of the exponential term,
 322 %
 323 \begin{equation}
 324 W_N^{x+N} = W_N^{x}.
 325 \end{equation}
 326 %
 327 Most of the products $ab$ in $W_N^{ab}$ are greater than $N$. By
 328 making use of this periodicity they can all be collapsed down into the
 329 range $0\dots N-1$. This allows us to reduce the number of operations
 330 by combining common terms, modulo $N$. Using this idea we can derive
 331 decimation-in-time or decimation-in-frequency algorithms, depending on
 332 how we break the DFT summation down into common terms. We'll first
 333 consider the decimation-in-time algorithm.
 334
 335 \subsection{Radix-2 Decimation-in-Time (DIT)}
 336 %
 337 To derive the the decimation-in-time algorithm we start by separating
 338 out the most significant bit of the index $b$,
 339 %
 340 \begin{equation}
 341 [b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + [b_{n-2} \dots b_1 b_0]
 342 \end{equation}
 343 %
 344 Now we can evaluate the innermost sum of the DFT without any
 345 dependence on the remaining bits of $b$ in the exponential,
 346 %
 347 \begin{eqnarray}
 348 h([a_{n-1} \dots a_1 a_0]) &=&
 349 \sum_{b_0=0}^{1}
 350 \sum_{b_1=0}^{1}
 351 \dots
 352 \sum_{b_{n-1}=0}^{1}
 353  g(b)
 354 W_N^{a(2^{n-1}b_{n-1}+[b_{n-2} \dots b_1 b_0])} \\
 355  &=&
 356 \sum_{b_0=0}^{1}
 357 \sum_{b_1=0}^{1}
 358 \dots
 359 \sum_{b_{n-2}=0}^{1}
 360 W_N^{a[b_{n-2} \dots b_1 b_0]}
 361 \sum_{b_{n-1}=0}^{1}
 362  g(b)
 363 W_N^{a(2^{n-1}b_{n-1})}
 364 \end{eqnarray}
 365 %
 366 Looking at the term $W_N^{a(2^{n-1}b_{n-1})}$ we see that we can also
 367 remove most of the dependence on $a$ as well, by using the periodicity of the
 368 exponential,
 369 %
 370 \begin{eqnarray}
 371 W_N^{a(2^{n-1}b_{n-1})} &=&
 372 \exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-1} b_{n-1}/ 2^n )\\
 373 &=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] b_{n-1}/ 2 )\\
 374 &=& \exp(-2\pi i ( 2^{n-2}a_{n-1} + \dots + a_1 + (a_0/2)) b_{n-1} )\\
 375 &=& \exp(-2\pi i a_0 b_{n-1}/2 ) \\
 376 &=& W_2^{a_0 b_{n-1}}
 377 \end{eqnarray}
 378 %
 379 Thus the innermost exponential term simplifies so that it only
 380 involves the highest order bit of $b$ and the lowest order bit of $a$,
 381 and the sum can be reduced to,
 382 %
 383 \begin{equation}
 384 h([a_{n-1} \dots a_1 a_0])
 385 =
 386 \sum_{b_0=0}^{1}
 387 \sum_{b_1=0}^{1}
 388 \dots
 389 \sum_{b_{n-2}=0}^{1}
 390 W_N^{a[b_{n-2} \dots b_1 b_0]}
 391 \sum_{b_{n-1}=0}^{1}
 392  g(b)
 393 W_2^{a_0 b_{n-1}}.
 394 \end{equation}
 395 %
 396 We can repeat this this procedure for the next most significant bit of
 397 $b$, $b_{n-2}$, using a similar identity,
 398 %
 399 \begin{eqnarray}
 400 W_N^{a(2^{n-2}b_{n-2})}
 401 &=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-2} b_{n-2}/ 2^n )\\
 402 &=& W_4^{ [a_1 a_0] b_{n-2}}.
 403 \end{eqnarray}
 404 %
 405 to give a formula with even less dependence on the bits of $a$,
 406 %
 407 \begin{equation}
 408 h([a_{n-1} \dots a_1 a_0])
 409 =
 410 \sum_{b_0=0}^{1}
 411 \sum_{b_1=0}^{1}
 412 \dots
 413 \sum_{b_{n-3}=0}^{1}
 414 W_N^{a[b_{n-3} \dots b_1 b_0]}
 415 \sum_{b_{n-2}=0}^{1}
 416 W_4^{[a_1 a_0] b_{n-2}}
 417 \sum_{b_{n-1}=0}^{1}
 418  g(b)
 419 W_2^{a_0 b_{n-1}}.
 420 \end{equation}
 421 %
 422 If we repeat the process for all the remaining bits we obtain a
 423 simplified DFT formula which is the basis of the radix-2
 424 decimation-in-time algorithm,
 425 %
 426 \begin{eqnarray}
 427 h([a_{n-1} \dots a_1 a_0]) &=&
 428 \sum_{b_0=0}^{1}
 429 W_{N}^{[a_{n-1} \dots a_1 a_0]b_0}
 430 %\sum_{b_1=0}^{1}
 431 %W_{N/2}^{[a_{n-1} \dots a_1 a_0]b_1}
 432 \dots
 433 \sum_{b_{n-2}=0}^{1}
 434 W_4^{ [a_1 a_0] b_{n-2}}
 435 \sum_{b_{n-1}=0}^{1}
 436 W_2^{a_0 b_{n-1}}
 437 g(b)
 438 \end{eqnarray}
 439 %
 440 To convert the formula to an algorithm we expand out the sum
 441 recursively, evaluating each of the intermediate summations, which we
 442 denote by $g_1$, $g_2$, \dots, $g_n$,
 443 %
 444 \begin{eqnarray}
 445 g_1(a_0,  b_{n-2}, b_{n-3}, \dots, b_1, b_0)
 446 &=&
 447 \sum_{b_{n-1}=0}^{1}
 448 W_2^{a_0 b_{n-1}}
 449 g([b_{n-1}  b_{n-2}  b_{n-3}  \dots  b_1  b_0]) \\
 450 g_2(a_0, {}_{\phantom{-2}} a_{1}, b_{n-3}, \dots, b_1, b_0)
 451 &=&
 452 \sum_{b_{n-2}=0}^{1}
 453 W_4^{[a_1 a_0] b_{n-2}}
 454 g_1(a_0, b_{n-2}, b_{n-3}, \dots, b_1, b_0) \\
 455 g_3(a_0, {}_{\phantom{-2}} a_{1}, {}_{\phantom{-3}} a_{2}, \dots, b_1, b_0)
 456 &=&
 457 \sum_{b_{n-3}=0}^{1}
 458 W_8^{[a_2 a_1 a_0] b_{n-2}}
 459 g_2(a_0, a_1, b_{n-3}, \dots, b_1, b_0) \\
 460 \dots &=& \dots \\
 461 g_n(a_0, a_{1}, a_{2}, \dots, a_{n-2}, a_{n-1})
 462 &=&
 463 \sum_{b_{0}=0}^{1}
 464 W_N^{[a_{n-1} \dots a_1 a_0]b_0}
 465 g_{n-1}(a_0, a_1, a_2, \dots, a_{n-2}, b_0)
 466 \end{eqnarray}
 467 %
 468 After the final sum, we can obtain the transform $h$ from $g_n$,
 469 %
 470 \begin{equation}
 471 h([a_{n-1} \dots a_1 a_0])
 472 =
 473 g_n(a_0, a_1, \dots, a_{n-1})
 474 \end{equation}
 475 %
 476 Note that we left the storage arrangements of the intermediate sums
 477 unspecified by using the bits as function arguments and not as an
 478 index. The storage of intermediate sums is different for the
 479 decimation-in-time and decimation-in-frequency algorithms.
 480
 481 Before deciding on the best storage scheme we'll show that the results
 482 of each stage, $g_1$, $g_2$, \dots, can be carried out {\em
 483 in-place}. For example, in the case of $g_1$, the inputs are,
 484 %
 485 \begin{equation}
 486 g([\underline{b_{n-1}} b_{n-2} b_{n-3} \dots b_1 b_0])
 487 \end{equation}
 488 %
 489 for $b_{n-1}=(0,1)$, and the corresponding outputs are,
 490 %
 491 \begin{equation}
 492 g_1(\underline{a_0},b_{n-2}, b_{n-3}, \dots, b_1, b_0)
 493 \end{equation}
 494 %
 495 for $a_0=(0,1)$.  It's clear that if we hold $b_{n-2}, b_{n-3}, \dots,
 496 b_1, b_0$ fixed and compute the sum over $b_{n-1}$ in memory for both
 497 values of $a_0 = 0,1$ then we can store the result for $a_0=0$ in the
 498 location which originally had $b_0=0$ and the result for $a_0=1$ in
 499 the location which originally had $b_0=1$. The two inputs and two
 500 outputs are known as {\em dual node pairs}. At each stage of the
 501 calculation the sums for each dual node pair are independent of the
 502 others. It is this property which allows an in-place calculation.
 503
 504 So for an in-place pass our storage has to be arranged so that the two
 505 outputs $g_1(a_0,\dots)$ overwrite the two input terms
 506 $g([b_{n-1},\dots])$. Note that the order of $a$ is reversed from the
 507 natural order of $b$. i.e. the least significant bit of $a$
 508 replaces the most significant bit of $b$. This is inconvenient
 509 because $a$ occurs in its natural order in all the exponentials,
 510 $W^{ab}$. We could keep track of both $a$ and its bit-reverse,
 511 $a^{\mathit bit-reversed}$ at all times but there is a neat trick
 512 which avoids this: if we bit-reverse the order of the input data $g$
 513 before we start the calculation we can also bit-reverse the order of
 514 $a$ when storing intermediate results. Since the storage involving $a$
 515 was originally in bit-reversed order the switch in the input $g$ now
 516 allows us to use normal ordered storage for $a$, the same ordering
 517 that occurs in the exponential factors.
 518
 519 This is complicated to explain, so here is an example of the 4 passes
 520 needed for an $N=16$ decimation-in-time FFT, with the initial data
 521 stored in bit-reversed order,
 522 %
 523 \begin{eqnarray}
 524 g_1([b_0 b_1 b_2 a_0])
 525 &=&
 526 \sum_{b_3=0}^{1} W_2^{a_0 b_3} g([b_0 b_1 b_2 b_3])
 527 \\
 528 g_2([b_0 b_1 a_1 a_0])
 529 &=&
 530 \sum_{b_2=0}^{1} W_4^{[a_1 a_0] b_2} g_1([b_0 b_1 b_2 a_0])
 531 \\
 532 g_3([b_0 a_2 a_1 a_0])
 533 &=&
 534 \sum_{b_1=0}^{1} W_8^{[a_2 a_1 a_0] b_1} g_2([b_0 b_1 a_1 a_0])
 535 \\
 536 h(a) = g_4([a_3 a_2 a_1 a_0])
 537 &=&
 538 \sum_{b_0=0}^{1} W_{16}^{[a_3 a_2 a_1 a_0] b_0} g_3([b_0 a_2 a_1 a_0])
 539 \end{eqnarray}
 540 %
 541 We compensate for the bit reversal of the input data by accessing $g$
 542 with the bit-reversed form of $b$ in the first stage. This ensures
 543 that we are still carrying out the same calculation, using the same
 544 data, and not accessing different values. Only single bits of $b$ ever
 545 occur in the exponential so we never need the bit-reversed form of
 546 $b$.
 547
 548 Let's examine the third pass in detail,
 549 %
 550 \begin{equation}
 551 g_3([b_0 a_2 a_1 a_0])
 552 =
 553 \sum_{b_1=0}^{1} W_8^{[a_2 a_1 a_0] b_1} g_2([b_0 b_1 a_1 a_0])
 554 \end{equation}
 555 %
 556 First note that only one bit, $b_1$, varies in each summation.  The
 557 other bits of $b$ ($b_0$) and of $a$ ($a_1 a_0$) are essentially
 558 ``spectators'' -- we must loop over all combinations of these bits and
 559 carry out the same basic calculation for each, remembering to update
 560 the exponentials involving $W_8$ appropriately.  If we are storing the
 561 results in-place (with $g_3$ overwriting $g_2$ we will need to compute
 562 the sums involving $b_1=0,1$ and $a_2=0,1$ simultaneously.
 563 %
 564 \begin{equation}
 565 \left(
 566 \begin{array}{c}
 567 g_3([b_0 0 a_1 a_0]) \vphantom{W_8^{[]}} \\
 568 g_3([b_0 1 a_1 a_0]) \vphantom{W_8^{[]}}
 569 \end{array}
 570 \right)
 571 =
 572 \left(
 573 \begin{array}{c}
 574 g_2([b_0 0 a_1 a_0]) + W_8^{[0 a_1 a_0]} g_2([b_2 1 a_1 a_0]) \\
 575 g_2([b_0 0 a_1 a_0]) + W_8^{[1 a_1 a_0]} g_2([b_2 1 a_1 a_0])
 576 \end{array}
 577 \right)
 578 \end{equation}
 579 %
 580 We can write this in a more symmetric form by simplifying the exponential,
 581 %
 582 \begin{equation}
 583 W_8^{[a_2 a_1 a_0]}
 584 = W_8^{4 a_2 + [a_1 a_0]}
 585 = (-1)^{a_2} W_8^{[a_1 a_0]}
 586 \end{equation}
 587 %
 588 \begin{equation}
 589 \left(
 590 \begin{array}{c}
 591 g_3([b_0 0 a_1 a_0]) \vphantom{W_8^{[]}} \\
 592 g_3([b_0 1 a_1 a_0]) \vphantom{W_8^{[]}}
 593 \end{array}
 594 \right)
 595 =
 596 \left(
 597 \begin{array}{c}
 598 g_2([b_0 0 a_1 a_0]) + W_8^{[a_1 a_0]} g_2([b_2 1 a_1 a_0]) \\
 599 g_2([b_0 0 a_1 a_0]) - W_8^{[a_1 a_0]} g_2([b_2 1 a_1 a_0])
 600 \end{array}
 601 \right)
 602 \end{equation}
 603 %
 604 The exponentials $W_8^{[a_1 a_0]}$ are referred to as {\em twiddle
 605 factors}. The form of this calculation, a symmetrical sum and
 606 difference involving a twiddle factor is called {\em a butterfly}.
 607 It is often shown diagrammatically, and in the case $b_0=a_0=a_1=0$
 608 would be drawn like this,
 609 %
 610 \begin{center}
 611 \setlength{\unitlength}{1cm}
 612 \begin{picture}(9,4)
 613 %
 614 %\put(0,0){\line(1,0){8}}
 615 %\put(0,0){\line(0,1){4}}
 616 %\put(8,4){\line(0,-1){4}}
 617 %\put(8,4){\line(-1,0){8}}
 618 %
 619 \put(0,1){$g_2(4)$} \put(4.5,1){$g_3(4)=g_2(0) - W^a_8 g_2(4)$}
 620 \put(0,3){$g_2(0)$} \put(4.5,3){$g_3(0)=g_2(0) + W^a_8 g_2(4)$}
 621 \put(1,1){\vector(1,0){0.5}}
 622 \put(1.5,1){\line(1,0){0.5}}
 623 \put(1.5,0.5){$W^a_8$}
 624 \put(1,3){\vector(1,0){0.5}}\put(1.5,3){\line(1,0){0.5}}
 625 \put(2,1){\circle*{0.1}}
 626 \put(2,3){\circle*{0.1}}
 627 \put(2,1){\vector(1,1){2}}
 628 \put(2,1){\vector(1,0){1}}
 629 \put(3,1){\line(1,0){1}}
 630 \put(3,0.5){$-1$}
 631 \put(2,3){\vector(1,-1){2}}
 632 \put(2,3){\vector(1,0){1}}
 633 \put(3,3){\line(1,0){1}}
 634 \put(4,1){\circle*{0.1}}
 635 \put(4,3){\circle*{0.1}}
 636 \end{picture}
 637 \end{center}
 638 %
 639 The inputs are shown on the left and the outputs on the right. The
 640 outputs are computed by multiplying the incoming lines by their
 641 accompanying factors (shown next to the lines) and summing the results
 642 at each node.
 643
 644 In general, denoting the bit for dual-node pairs by $\Delta$ and the
 645 remaining bits of $a$ and $b$ by ${\hat a}$ and ${\hat b}$, the
 646 butterfly is,
 647 %
 648 \begin{equation}
 649 \left(
 650 \begin{array}{c}
 651 g({\hat b} + {\hat a}) \\
 652 g({\hat b} + \Delta + {\hat a}) \\
 653 \end{array}
 654 \right)
 655 \leftarrow
 656 \left(
 657 \begin{array}{c}
 658 g({\hat b} + {\hat a}) + W_{2\Delta}^{\hat a} g({\hat b} + \Delta + {\hat a})\\
 659 g({\hat b} + {\hat a}) - W_{2\Delta}^{\hat a} g({\hat b} + \Delta + {\hat a})
 660 \end{array}
 661 \right)
 662 \end{equation}
 663 %
 664 where ${\hat a}$ runs from $0 \dots \Delta-1$ and ${\hat b}$ runs
 665 through $0 \times 2\Delta$, $1\times 2\Delta$, $\dots$, $(N/\Delta -
 666 1)2\Delta$. The value of $\Delta$ is 1 on the first pass, 2 on the
 667 second pass and $2^{n-1}$ on the $n$-th pass.  Each pass requires
 668 $N/2$ in-place computations, each involving two input locations and
 669 two output locations.
 670
 671 In the example above $\Delta = [100] = 4$, ${\hat a} = [a_1 a_0]$ and
 672 ${\hat b} = [b_0 0 0 0]$.
 673
 674 This leads to the canonical radix-2 decimation-in-time FFT algorithm
 675 for $2^n$ data points stored in the array $g(0) \dots g(2^n-1)$.
 676 %
 677 \begin{algorithm}
 678 \STATE bit-reverse ordering of $g$
 679 \STATE {$\Delta \Leftarrow 1$}
 680 \FOR {$\mbox{pass} = 1 \dots n$}
 681   \STATE {$W \Leftarrow \exp(-2 \pi i / 2\Delta)$}
 682   \FOR {$(a = 0 ; a < \Delta ; a{++})$}
 683     \FOR{$(b = 0 ; b < N ; b {+=} 2*\Delta)$}
 684         \STATE{$t_0 \Leftarrow g(b+a) + W^a g(b+\Delta+a)$}
 685         \STATE{$t_1 \Leftarrow g(b+a) - W^a g(b+\Delta+a)$}
 686         \STATE{$g(b+a) \Leftarrow t_0$}
 687         \STATE{$g(b+\Delta+a) \Leftarrow t_1$}
 688     \ENDFOR
 689   \ENDFOR
 690   \STATE{$\Delta \Leftarrow 2\Delta$}
 691 \ENDFOR
 692 \end{algorithm}
 693 %
 694 %This algorithm appears in Numerical Recipes as the routine {\tt
 695 %four1}, where the implementation is attributed to N.M. Brenner.
 696 %
 697 \subsection{Details of the Implementation}
 698 It is straightforward to implement a simple radix-2 decimation-in-time
 699 routine from the algorithm above. Some optimizations can be made by
 700 pulling the special case of $a=0$ out of the loop over $a$, to avoid
 701 unnecessary multiplications when $W^a=1$,
 702 %
 703 \begin{algorithm}
 704     \FOR{$(b = 0 ; b < N ; b {+=} 2*\Delta)$}
 705         \STATE{$t_0 \Leftarrow g(b) + g(b+\Delta)$}
 706         \STATE{$t_1 \Leftarrow g(b) - g(b+\Delta)$}
 707         \STATE{$g(b) \Leftarrow t_0$}
 708         \STATE{$g(b+\Delta) \Leftarrow t_1$}
 709     \ENDFOR
 710 \end{algorithm}
 711 %
 712 There are several algorithms for doing fast bit-reversal. We use the
 713 Gold-Rader algorithm, which is simple and does not require any working
 714 space,
 715 %
 716 \begin{algorithm}
 717 \FOR{$i = 0 \dots n - 2$}
 718         \STATE {$ k = n / 2 $}
 719         \IF {$i < j$}
 720                 \STATE {swap $g(i)$ and $g(j)$}
 721         \ENDIF
 722
 723         \WHILE {$k \leq j$}
 724                 \STATE{$j \Leftarrow j - k$}
 725                 \STATE{$k \Leftarrow k / 2$}
 726         \ENDWHILE
 727
 728       \STATE{$j \Leftarrow j + k$}
 729 \ENDFOR
 730 \end{algorithm}
 731 %
 732 The Gold-Rader algorithm is typically twice as fast as a naive
 733 bit-reversal algorithm (where the bit reversal is carried out by
 734 left-shifts and right-shifts on the index).  The library also has a
 735 routine for the Rodriguez bit reversal algorithm, which also does not
 736 require any working space~\cite{rodriguez89}. There are faster bit
 737 reversal algorithms, but they all use additional scratch
 738 space~\cite{rosel89}.
 739
 740 Within the loop for $a$ we can compute $W^a$  using a trigonometric
 741 recursion relation,
 742 %
 743 \begin{eqnarray}
 744 W^{a+1} &=& W W^a \\
 745         &=& (\cos(2\pi/2\Delta) + i \sin(2\pi/2\Delta)) W^a
 746 \end{eqnarray}
 747 %
 748 This requires only $2 \log_2 N$ trigonometric calls, to compute the
 749 initial values of $\exp(2\pi i /2\Delta)$ for each pass.
 750
 751 \subsection{Radix-2 Decimation-in-Frequency (DIF)}
 752 %
 753 To derive the decimation-in-frequency algorithm we start by separating
 754 out the lowest order bit of the index $a$. Here is an example for the
 755 decimation-in-frequency $N=16$ DFT.
 756 %
 757 \begin{eqnarray}
 758 W_{16}^{[a_3 a_2 a_1 a_0][b_3 b_2 b_1 b_0]}
 759 &=&
 760 W_{16}^{[a_3 a_2 a_1 a_0][b_2 b_1 b_0]} W_{16}^{[a_3 a_2 a_1 a_0] [b_3
 761 0 0 0]} \\
 762 &=&
 763 W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} W_{16}^{a_0 [b_2 b_1 b_0]} W_2^{a_0
 764 b_3} \\
 765 &=&
 766 W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} W_{16}^{a_0 [b_2 b_1 b_0]} (-1)^{a_0 b_3}
 767 \end{eqnarray}
 768 %
 769 By repeating the same type of the expansion on the term,
 770 %
 771 \begin{equation}
 772 W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]}
 773 \end{equation}
 774 %
 775 we can reduce the transform to an alternative simple form,
 776 %
 777 \begin{equation}
 778 h(a) =
 779 \sum_{b_0=0}^1 (-1)^{a_3 b_0} W_4^{a_2 b_0}
 780 \sum_{b_1=0}^1 (-1)^{a_2 b_1} W_8^{a_1 [b_1 b_0]}
 781 \sum_{b_2=0}^1 (-1)^{a_1 b_2} W_{16}^{a_0 [b_2 b_1 b_0]}
 782 \sum_{b_3=0}^1 (-1)^{a_0 b_3} g(b)
 783 \end{equation}
 784 %
 785 To implement this we can again write the sum recursively. In this case
 786 we do not have the problem of the order of $a$ being bit reversed --
 787 the calculation can be done in-place using the natural ordering of
 788 $a$ and $b$,
 789 %
 790 \begin{eqnarray}
 791 g_1([a_0 b_2 b_1 b_0])
 792 &=&
 793 W_{16}^{a_0 [b_2 b_1 b_0]}
 794 \sum_{b_3=0}^1 (-1)^{a_0 b_3} g([b_3 b_2 b_1 b_0]) \\
 795 g_2([a_0 a_1 b_1 b_0])
 796 &=&
 797 W_{8}^{a_1 [b_1 b_0]}
 798 \sum_{b_2=0}^1 (-1)^{a_1 b_2} g_1([a_0 b_2 b_1 b_0]) \\
 799 g_3([a_0 a_1 a_2 b_0])
 800 &=&
 801 W_{4}^{a_2 b_0}
 802 \sum_{b_1=0}^1 (-1)^{a_2 b_1} g_2([a_0 a_1 b_1 b_0]) \\
 803 h(a)
 804 =
 805 g_4([a_0 a_1 a_2 a_3])
 806 &=&
 807 \sum_{b_0=0}^1 (-1)^{a_3 b_0} g_3([a_0 a_1 a_2 b_0])
 808 \end{eqnarray}
 809 %
 810 The final pass leaves the data for $h(a)$ in bit-reversed order, but
 811 this is easily fixed by a final bit-reversal of the ordering.
 812
 813 The basic in-place calculation or butterfly for each pass is slightly
 814 different from the decimation-in-time version,
 815 %
 816 \begin{equation}
 817 \left(
 818 \begin{array}{c}
 819 g({\hat a} + {\hat b}) \\
 820 g({\hat a} + \Delta + {\hat b}) \\
 821 \end{array}
 822 \right)
 823 \leftarrow
 824 \left(
 825 \begin{array}{c}
 826 g({\hat a} + {\hat b}) +  g({\hat a} + \Delta + {\hat b})\\
 827 W_{\Delta}^{\hat b}
 828 \left( g({\hat a} + {\hat b}) -  g({\hat a} + \Delta + {\hat b}) \right)
 829 \end{array}
 830 \right)
 831 \end{equation}
 832 %
 833 In each pass ${\hat b}$ runs from $0 \dots \Delta-1$ and ${\hat
 834 a}$ runs from $0, 2\Delta, \dots, (N/\Delta -1) \Delta$. On the first
 835 pass we start with $\Delta=16$, and on subsequent passes $\Delta$ takes
 836 the values $8, 4, \dots, 1$.
 837
 838 This leads to the canonical radix-2 decimation-in-frequency FFT
 839 algorithm for $2^n$ data points stored in the array $g(0) \dots
 840 g(2^n-1)$.
 841 %
 842 \begin{algorithm}
 843 \STATE {$\Delta \Leftarrow 2^{n-1}$}
 844 \FOR {$\mbox{pass} = 1 \dots n$}
 845   \STATE {$W \Leftarrow \exp(-2 \pi i / 2\Delta)$}
 846   \FOR {$(b = 0 ; b < \Delta ; b++)$}
 847     \FOR{$(a = 0 ; a < N ; a += 2*\Delta)$}
 848         \STATE{$t_0 \Leftarrow g(b+a) + g(a+\Delta+b)$}
 849         \STATE{$t_1 \Leftarrow W^b \left( g(a+b) - g(a+\Delta+b) \right)$}
 850         \STATE{$g(a+b) \Leftarrow t_0$}
 851         \STATE{$g(a+\Delta+b) \Leftarrow t_1$}
 852     \ENDFOR
 853   \ENDFOR
 854   \STATE{$\Delta \Leftarrow \Delta/2$}
 855 \ENDFOR
 856 \STATE bit-reverse ordering of $g$
 857 \end{algorithm}
 858 %
 859
 860 \section{Self-Sorting Mixed-Radix Complex FFTs}
 861 %
 862 This section is based on the review article {\em Self-sorting
 863 Mixed-Radix Fast Fourier Transforms} by Clive
 864 Temperton~\cite{temperton83}. You should consult his article for full
 865 details of all the possible algorithms (there are many
 866 variations). Here I have annotated the derivation of the simplest
 867 mixed-radix decimation-in-frequency algorithm.
 868
 869 For general-$N$ FFT algorithms the simple binary-notation of radix-2
 870 algorithms is no longer useful.  The mixed-radix FFT has to be built
 871 up using products of matrices acting on a data vector.  The aim is to
 872 take the full DFT matrix $W_N$ and factor it into a set of small,
 873 sparse matrices corresponding to each factor of $N$.
 874
 875
 876 We'll denote the components of matrices using either subscripts or
 877 function notation,
 878 %
 879 \begin{equation}
 880 M_{ij} = M(i,j)
 881 \end{equation}
 882 %
 883 with (C-like) indices running from 0 to $N-1$.  Matrix products will be
 884 denoted using square brackets,
 885 %
 886 \begin{equation}
 887 [AB]_{ij} = \sum_{k} A_{ik} B_{kj}
 888 \end{equation}
 889 %
 890 %
 891 Three special matrices will be needed in the mixed-radix factorization
 892 of the DFT: the identity matrix, $I$, a permutation matrix, $P$ and a
 893 matrix of twiddle factors, $D$, as well as the normal DFT matrices
 894 $W_n$.
 895
 896 We write the identity matrix of order $r$ as $I_r(n,m)$,
 897 %
 898 \begin{equation}
 899 I_r(n,m) = \delta_{nm}
 900 \end{equation}
 901 %
 902 for $0 \leq n,m \leq r-1$.
 903
 904 We also need to define a permutation matrix $P^a_b$ that performs
 905 digit reversal of the ordering of a vector. If the index of a vector
 906 $j= 0\dots N-1$ is factorized into $j = la +m$, with $0 \leq l \leq
 907 b-1$ and $0 \leq m \leq a-1$ then the operation of the matrix $P$ will
 908 exchange positions $la+m$ and $mb+l$ in the vector (this sort of
 909 digit-reversal is the generalization of bit-reversal to a number
 910 system with exponents $a$ and $b$).
 911
 912 In mathematical terms $P$ is a square matrix of size $ab \times ab$
 913 with the property,
 914 %
 915 \begin{eqnarray}
 916 P^a_b(j,k) &=& 1 ~\mbox{if}~ j=ra+s ~\mbox{and}~ k=sb+r \\
 917            &=& 0 ~\mbox{otherwise}
 918 \end{eqnarray}
 919 %
 920
 921 Finally the FFT algorithm needs a matrix of twiddle factors, $D^a_b$,
 922 for the trigonometric sums. $D^a_b$ is a diagonal square matrix of
 923 size $ab \times ab$ with the definition,
 924 %
 925 \begin{eqnarray}
 926 D^a_b(j,k) &=& \omega^{sr}_{ab} ~\mbox{if}~ j=k=sb+r \\
 927            &=& 0 ~\mbox{otherwise}
 928 \end{eqnarray}
 929 %
 930 where $\omega_{ab} = e^{-2\pi i/ab}$.
 931
 932
 933 \subsection{The Kronecker Product}
 934 The Kronecker matrix product plays an important role in all the
 935 algorithms for combining operations on different subspaces. The
 936 Kronecker product $A \otimes B$ of two square matrices $A$ and $B$, of
 937 sizes $a \times a$ and $b \times b$ respectively, is a square matrix
 938 of size $a b \times a b$, defined as,
 939 %
 940 \begin{equation}
 941 [A \otimes B] (tb+u, rb+s) = A(t,r) B(u,s)
 942 \end{equation}
 943 %
 944 where $0 \leq u,s < b$ and $0 \leq t,r < a$.  Let's examine a specific
 945 example. If we take a $2 \times 2$ matrix and a $3
 946 \times 3$ matrix,
 947 %
 948 \begin{equation}
 949 \begin{array}{ll}
 950 A =
 951 \left(
 952 \begin{array}{cc}
 953 a_{11} & a_{12} \\
 954 a_{21} & a_{22}
 955 \end{array}
 956 \right)
 957 &
 958 B =
 959 \left(
 960 \begin{array}{ccc}
 961 b_{11} & b_{12} & b_{13} \\
 962 b_{21} & b_{22} & b_{23} \\
 963 b_{31} & b_{32} & b_{33}
 964 \end{array}
 965 \right)
 966 \end{array}
 967 \end{equation}
 968 %
 969 then the Kronecker product $A \otimes B$ is,
 970 %
 971 \begin{eqnarray}
 972 A \otimes B &= &
 973 \left(
 974 \begin{array}{cc}
 975 a_{11} B & a_{12} B \\
 976 a_{12} B & a_{22} B
 977 \end{array}
 978 \right) \\
 979  &=&
 980 \left(
 981 \begin{array}{cccccc}
 982 a_{11} b_{11} & a_{11} b_{12} & a_{11} b_{13} &
 983   a_{12} b_{11} & a_{12} b_{12} & a_{12} b_{13} \\
 984 a_{11} b_{21} & a_{11} b_{22} & a_{11} b_{23} &
 985   a_{12} b_{21} & a_{12} b_{22} & a_{12} b_{23} \\
 986 a_{11} b_{31} & a_{11} b_{32} & a_{11} b_{33} &
 987   a_{12} b_{31} & a_{12} b_{32} & a_{12} b_{33} \\
 988 a_{21} b_{11} & a_{21} b_{12} & a_{21} b_{13} &
 989   a_{22} b_{11} & a_{22} b_{12} & a_{22} b_{13} \\
 990 a_{21} b_{21} & a_{21} b_{22} & a_{21} b_{23} &
 991   a_{22} b_{21} & a_{22} b_{22} & a_{22} b_{23} \\
 992 a_{21} b_{31} & a_{21} b_{32} & a_{21} b_{33} &
 993   a_{22} b_{31} & a_{22} b_{32} & a_{22} b_{33}
 994 \end{array}
 995 \right)
 996 \end{eqnarray}
 997 %
 998 When the Kronecker product $A \otimes B$ acts on a vector of length
 999 $ab$, each matrix operates on a different subspace of the vector.
1000 Writing the index $i$ as $i=t b + u$, with $0\leq u \leq b-1$
1001 and $0\leq t\leq a$, we can see this explicitly by looking at components,
1002 %
1003 \begin{eqnarray}
1004 [(A \otimes B) v]_{(tb+u)}
1005 & = & \sum_{t'=0}^{a-1} \sum_{u'=0}^{b-1}
1006         [A \otimes B]_{(tb+u,t'b+u')} v_{t'b+u'} \\
1007 & = & \sum_{t'u'} A_{tt'} B_{uu'} v_{t'b+u'}
1008 \end{eqnarray}
1009 %
1010 The matrix $B$ operates on the ``index'' $u'$, for all values of $t'$, and
1011 the matrix $A$ operates on the ``index'' $t'$, for all values of $u'$.
1012 %
1013 The most important property needed for deriving the FFT factorization
1014 is that the matrix product of two Kronecker products is the Kronecker
1015 product of the two matrix products,
1016 %
1017 \begin{equation}
1018 (A \otimes B)(C \otimes D) = (AC \otimes BD)
1019 \end{equation}
1020 %
1021 This follows straightforwardly from the original definition of the
1022 Kronecker product.
1023
1024 \subsection{Two factor case, $N=ab$}
1025 %
1026 First consider the simplest possibility, where the data length $N$ can
1027 be divided into two factors, $N=ab$. The aim is to reduce the DFT
1028 matrix $W_N$ into simpler matrices corresponding to each factor. To
1029 make the derivation easier we will start from the known factorization
1030 and verify it (the initial factorization can be guessed by
1031 generalizing from simple cases). Here is the factorization we are
1032 going to prove,
1033 %
1034 \begin{equation}
1035 W_{ab} = (W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b).
1036 \end{equation}
1037 %
1038 We can check it by expanding the product into components,
1039 %
1040 \begin{eqnarray}
1041 \lefteqn{[(W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b)](la+m,rb+s)}  \\
1042 & = &
1043 \sum_{u=0}^{b-1} \sum_{t=0}^{a-1}
1044 [(W_b \otimes I_a)](la+m,ua+t) [P^a_b D^a_b (W_a \otimes I_b)](ua+t,rb+s)
1045 \end{eqnarray}
1046 %
1047 where we have split the indices to match the Kronecker product $0 \leq
1048 m, r \leq a$, $0 \leq l, s \leq b$.  The first term in the sum can
1049 easily be reduced to its component form,
1050 %
1051 \begin{eqnarray}
1052 [(W_b \otimes I_a)](la+m,ua+t)
1053 &=& W_b(l,u) I_a(m,t) \\
1054 &=& \omega_b^{lu} \delta_{mt}
1055 \end{eqnarray}
1056 %
1057 The second term is more complicated. We can expand the Kronecker
1058 product like this,
1059 \begin{eqnarray}
1060 (W_a \otimes I_b)(tb+u,rb+s)
1061 &=& W_a(t,r) I_a(u,s) \\
1062 &=& \omega_a^{tr} \delta_{us}
1063 \end{eqnarray}
1064 %
1065 and use this term to build up the product, $P^a_b D^a_b (W_a \otimes
1066 I_b)$. We first multiply by $D^a_b$,
1067 %
1068 \begin{equation}
1069 [D^a_b (W_a \otimes I_b)](tb+u,rb+s)
1070 =
1071 \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
1072 \end{equation}
1073 %
1074 and then apply the permutation matrix, $P^a_b$, which digit-reverses
1075 the ordering of the first index, to obtain,
1076 %
1077 \begin{equation}
1078 [P^a_b D^a_b (W_a \otimes I_b)](ua+t,rb+s)
1079 =
1080 \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
1081 \end{equation}
1082 %
1083 Combining the two terms in the matrix product we can obtain the full
1084 expansion in terms of the exponential $\omega$,
1085 %
1086 \begin{eqnarray}
1087 [(W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b)](la+m,rb+s)
1088 &=&
1089 \sum_{u=0}^{b-1} \sum_{t=0}^{a-1}
1090 \omega_b^{lu} \delta_{mt} \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
1091 \end{eqnarray}
1092 %
1093 If we evaluate this sum explicitly we can make the connection between
1094 the product involving $W_a$ and $W_b$ (above) and the expansion of the
1095 full DFT matrix $W_{ab}$,
1096 %
1097 \begin{eqnarray}
1098 \sum_{u=0}^{b-1} \sum_{t=0}^{a-1}
1099 \omega_b^{lu} \delta_{mt} \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
1100 &=& \omega^{ls}_b \omega^{ms}_{ab} \omega^{mr}_a \\
1101 &=& \omega^{als + ms +bmr}_{ab} \\
1102 &=& \omega^{als + ms +bmr}_{ab} \omega^{lrab}_{ab} \quad\mbox{using~} \omega^{ab}_{ab} =1\\
1103 &=& \omega^{(la+m)(rb+s)}_{ab} \\
1104 &=& W_{ab}(la+m,rb+s)
1105 \end{eqnarray}
1106 %
1107 The final line shows that matrix product given above is identical to
1108 the full two-factor DFT matrix, $W_{ab}$.
1109 %
1110 Thus the full DFT matrix $W_{ab}$ for two factors $a$, $b$ can be
1111 broken down into a product of sub-transforms, $W_a$ and $W_b$, plus
1112 permutations, $P$, and twiddle factors, $D$, according to the formula,
1113 %
1114 \begin{equation}
1115 W_{ab} = (W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b).
1116 \end{equation}
1117 %
1118 This relation is the foundation of the general-$N$ mixed-radix FFT algorithm.
1119
1120 \subsection{Three factor case, $N=abc$}
1121 %
1122 The result for the two-factor expansion can easily be generalized to
1123 three factors. We first consider $abc$ as being a product of two
1124 factors $a$ and $(bc)$, and then further expand the product $(bc)$ into
1125 $b$ and $c$. The first step of the expansion looks like this,
1126 %
1127 \begin{eqnarray}
1128 W_{abc} &=& W_{a(bc)}\\
1129 &=& (W_{bc} \otimes I_a) P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) .
1130 \end{eqnarray}
1131 %
1132 And after using the two-factor result to expand out $W_{bc}$ we obtain
1133 the factorization of $W_{abc}$,
1134 %
1135 \begin{eqnarray}
1136 W_{abc} &=& (((W_c \otimes I_b) P^b_c D^b_c (W_b \otimes I_c)) \otimes I_a )
1137 P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) \\
1138 &=& (W_c \otimes I_{ab}) (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) P^a_{bc} D^a_{bc} (W_a \otimes I_c)
1139 \end{eqnarray}
1140 %
1141 We can write this factorization in a product form, with one term for
1142 each factor,
1143 %
1144 \begin{equation}
1145 W_{abc} = T_3 T_2 T_1
1146 \end{equation}
1147 %
1148 where we read off $T_1$, $T_2$ and $T_3$,
1149 %
1150 \begin{eqnarray}
1151 T_1 &=& P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) \\
1152 T_2 &=& (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) \\
1153 T_3 &=& (W_c \otimes I_{ab} )
1154 \end{eqnarray}
1155 %
1156
1157
1158 \subsection{General case, $N=f_1 f_2 \dots f_{n_f}$}
1159 %
1160 If we continue the procedure that we have used for two- and
1161 three-factors then a general pattern begins to emerge in the
1162 factorization of $W_{f_1 f_2 \dots f_{n_f}}$. To see the beginning of
1163 the pattern we can rewrite the three factor case as,
1164 %
1165 \begin{eqnarray}
1166 T_1 &=& (P^a_{bc} D^a_{bc} \otimes I_1) (W_a \otimes I_{bc}) \\
1167 T_2 &=& (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) \\
1168 T_3 &=& (P^c_1 D^c_1 \otimes I_{ab}) (W_c \otimes I_{ab} )
1169 \end{eqnarray}
1170 %
1171 using the special cases $P^c_1 = D^c_1 = I_c$.
1172 %
1173 In general, we can write the factorization of $W_N$ for $N= f_1 f_2
1174 \dots f_{n_f}$ as,
1175 %
1176 \begin{equation}
1177 W_N = T_{n_f} \dots T_2 T_1
1178 \end{equation}
1179 %
1180 where the matrix factors $T_i$ are,
1181 %
1182 \begin{equation}
1183 T_i = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ( W_{f_i}
1184 \otimes I_{m_i})
1185 \end{equation}
1186 %
1187 We have defined the following three additional variables $p$, $q$ and
1188 $m$ to denote different partial products of the factors,
1189 %
1190 \begin{eqnarray}
1191 p_i &=& f_1 f_2 \dots f_i \quad (p_0 = 1)  \\
1192 q_i &=& N / p_i  \\
1193 m_i &=& N / f_i
1194 \end{eqnarray}
1195 %
1196 Note that the FFT modules $W$ are applied before the permutations $P$,
1197 which makes this a decimation-in-frequency algorithm.
1198
1199 \subsection{Implementation}
1200 %
1201 Now to the implementation of the algorithm. We start with a vector of
1202 data, $z$, as input and want to apply the transform,
1203 %
1204 \begin{eqnarray}
1205 x &=& W_N z \\
1206   &=& T_{n_f} \dots T_2 T_1 z
1207 \end{eqnarray}
1208 %
1209 where $T_i = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) (
1210 W_{f_i} \otimes I_{m_i})$.
1211
1212 The outer structure of the implementation will be a loop over the
1213 $n_f$ factors, applying each matrix $T_i$ to the vector in turn to
1214 build up the complete transform.
1215 %
1216 \begin{algorithm}
1217 \FOR{$(i = 1 \dots n_f)$}
1218         \STATE{$v \Leftarrow T_i v $}
1219 \ENDFOR
1220 \end{algorithm}
1221 %
1222 The order of the factors is not important. Now we examine the iteration
1223 $v \Leftarrow T_i v$, which we'll write as,
1224 %
1225 \begin{equation}
1226 v' =
1227 (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ~
1228 ( W_{f_i} \otimes I_{m_i}) v
1229 \end{equation}
1230 %
1231 There are two Kronecker product matrices in this iteration. The
1232 rightmost matrix, which is the first to be applied, is a DFT of length
1233 $f_i$ applied to $N/f_i$ subsets of the data. We'll call this $t$,
1234 since it will be a temporary array,
1235 %
1236 \begin{equation}
1237 t = (W_{f_i} \otimes I_{m_i}) v
1238 \end{equation}
1239 %
1240 The second matrix applies a permutation and the exponential
1241 twiddle-factors. We'll call this $v'$, since it is the result of the
1242 full iteration on $v$,
1243 %
1244 \begin{equation}
1245 v' = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) t
1246 \end{equation}
1247
1248 The effect of the matrix $(W_{f_i} \otimes I_{m_i})$ is best seen by
1249 an example. Suppose the factor is $f_i = 3$, and the length of the FFT
1250 is $N=6$, then the relevant Kronecker product is,
1251 %
1252 \begin{equation}
1253 t = (W_3 \otimes I_2) v
1254 \end{equation}
1255 %
1256 which expands out to,
1257 %
1258 \begin{equation}
1259 \left(
1260 \begin{array}{c}
1261 t_0 \\
1262 t_1 \\
1263 t_2 \\
1264 t_3 \\
1265 t_4 \\
1266 t_5
1267 \end{array}
1268 \right)
1269 =
1270 \left(
1271 \begin{array}{cccccc}
1272 W_3(1,1) & 0 & W_3(1,2) & 0 & W_3(1,3) & 0 \\
1273 0 & W_3(1,1) & 0 & W_3(1,2) & 0 & W_3(1,3) \\
1274 W_3(2,1) & 0 & W_3(2,2) & 0 & W_3(2,3) & 0 \\
1275 0 & W_3(2,1) & 0 & W_3(2,2) & 0 & W_3(2,3) \\
1276 W_3(3,1) & 0 & W_3(3,2) & 0 & W_3(3,3) & 0 \\
1277 0 & W_3(3,1) & 0 & W_3(3,2) & 0 & W_3(3,3)
1278 \end{array}
1279 \right)
1280 \left(
1281 \begin{array}{c}
1282 v_0 \\
1283 v_1 \\
1284 v_2 \\
1285 v_3 \\
1286 v_4 \\
1287 v_5
1288 \end{array}
1289 \right)
1290 \end{equation}
1291 %
1292 We can rearrange the components in a computationally convenient form,
1293 \begin{equation}
1294 \left(
1295 \begin{array}{c}
1296 t_0 \\
1297 t_2 \\
1298 t_4 \\
1299 t_1 \\
1300 t_3 \\
1301 t_5
1302 \end{array}
1303 \right)
1304 =
1305 \left(
1306 \begin{array}{cccccc}
1307 W_3(1,1) & W_3(1,2) & W_3(1,3) & 0 & 0 & 0 \\
1308 W_3(2,1) & W_3(2,2) & W_3(2,3) & 0 & 0 & 0 \\
1309 W_3(3,1) & W_3(3,2) & W_3(3,3) & 0 & 0 & 0 \\
1310 0 & 0 & 0 & W_3(1,1) & W_3(1,2) & W_3(1,3) \\
1311 0 & 0 & 0 & W_3(2,1) & W_3(2,2) & W_3(2,3) \\
1312 0 & 0 & 0 & W_3(3,1) & W_3(3,2) & W_3(3,3)
1313 \end{array}
1314 \right)
1315 \left(
1316 \begin{array}{c}
1317 v_0 \\
1318 v_2 \\
1319 v_4 \\
1320 v_1 \\
1321 v_3 \\
1322 v_5
1323 \end{array}
1324 \right)
1325 \end{equation}
1326 %
1327 which clearly shows that we just need to apply the $3\times 3$ DFT
1328 matrix $W_3$ twice, once to the sub-vector of elements $(v_0, v_2, v_4)$,
1329 and independently to the remaining sub-vector $(v_1, v_3, v_5)$.
1330
1331 In the general case, if we index $t$ as $t_k = t(\lambda,\mu) =
1332 t_{\lambda m + \mu}$ then $\lambda = 0 \dots f-1$ is an index within
1333 each transform of length $f$ and $\mu = 0 \dots m-1$ labels the
1334 independent subsets of data. We can see this by showing the
1335 calculation with all indices present,
1336 %
1337 \begin{equation}
1338 t = (W_f \otimes I_m) z
1339 \end{equation}
1340 %
1341 becomes,
1342 %
1343 \begin{eqnarray}
1344 t_{\lambda m + \mu} &=& \sum_{\lambda'=0}^{f-1} \sum_{\mu'=0}^{m-1}
1345         (W_f \otimes I_m)_{(\lambda m + \mu)(\lambda' m + \mu')}
1346         z_{\lambda'm + \mu} \\
1347 &=& \sum_{\lambda'\mu'} (W_f)_{\lambda\lambda'} \delta_{\mu\mu'}
1348         z_{\lambda'm+\mu'} \\
1349 &=& \sum_{\lambda'} (W_f)_{\lambda\lambda'} z_{\lambda'm+\mu}
1350 \end{eqnarray}
1351 %
1352 The DFTs on the index $\lambda$ will be computed using
1353 special optimized modules for each $f$.
1354
1355 To calculate the next stage,
1356 %
1357 \begin{equation}
1358 v'=(P^f_q D^f_q \otimes I_{p_{i-1}}) t
1359 \end{equation}
1360 %
1361 we note that the Kronecker product has the property of performing
1362 $p_{i-1}$ independent multiplications of $PD$ on $q_{i-1}$ different
1363 subsets of $t$. The index $\mu$ of $t(\lambda,\mu)$ which runs from 0
1364 to $m$ will include $q_i$ copies of each $PD$ operation because
1365 $m=p_{i-1}q$. i.e. we can split the index $\mu$ further into $\mu = a
1366 p_{i-1} + b$, where $a = 0 \dots q-1$ and $b=0 \dots p_{i-1}$,
1367 %
1368 \begin{eqnarray}
1369 \lambda m + \mu &=& \lambda m + a p_{i-1} + b \\
1370         &=& (\lambda q + a) p_{i-1} + b.
1371 \end{eqnarray}
1372 %
1373 Now we can expand the second stage,
1374 %
1375 \begin{eqnarray}
1376 v'&=& (P^f_q D^f_q \otimes I_{p_{i-1}}) t \\
1377 v'_{\lambda m + \mu} &=& \sum_{\lambda' \mu'}
1378  (P^f_q D^f_q \otimes I_{p_{i-1}})_{(\lambda m + \mu)(\lambda' m + \mu')}
1379         t_{\lambda' m + \mu'} \\
1380 v'_{(\lambda q + a) p_{i-1} + b} &=& \sum_{\lambda' a' b'}
1381 (
1382 P^f_q D^f_q \otimes I_{p_{i-1}}
1383 )_{((\lambda q+ a)p_{i-1} + b)((\lambda' q+ a')p_{i-1} + b')}
1384 t_{(\lambda' q + a')p_{i-1} +b'}
1385 \end{eqnarray}
1386 %
1387 The first step in removing redundant indices is to take advantage of
1388 the identity matrix $I$ and separate the subspaces of the Kronecker
1389 product,
1390 %
1391 \begin{equation}
1392 (
1393 P^f_q D^f_q \otimes I_{p_{i-1}}
1394 )_{((\lambda q+ a)p_{i-1} + b)((\lambda' q+ a')p_{i-1} + b')}
1395 =
1396 (P^f_q D^f_q)_{(\lambda q + a)(\lambda' q + a')}
1397 \delta_{bb'}
1398 \end{equation}
1399 %
1400 This eliminates one sum, leaving us with,
1401 %
1402 \begin{equation}
1403 v'_{(\lambda q + a) p_{i-1} + b}
1404 =
1405 \sum_{\lambda' a' }
1406 (P^f_q D^f_q)_{(\lambda q + a)(\lambda' q + a')} t_{(\lambda'q+a')p_{i-1} + b}
1407 \end{equation}
1408 %
1409 We can insert the definition of $D^f_q$ to give,
1410 %
1411 \begin{equation}
1412 \phantom{v'_{(\lambda q + a) p_{i-1} + b}}
1413 = \sum_{\lambda'a'} (P^f_q)_{(\lambda q + a)(\lambda'q + a')}
1414 \omega^{\lambda'a'}_{q_{i-1}} t_{(\lambda'q+a')p_{i-1}+b}
1415 \end{equation}
1416 %
1417 Using the definition of $P^f_q$, which exchanges an index of $\lambda
1418 q + a$ with $a f + \lambda$, we get a final result with no matrix
1419 multiplication,
1420 %
1421 \begin{equation}
1422 v'_{(a f + \lambda) p_{i-1} + b}
1423 = \omega^{\lambda a}_{q_{i-1}} t_{(\lambda q + a)p_{i-1} + b}
1424 \end{equation}
1425 %
1426 All we have to do is premultiply each element of the temporary vector
1427 $t$ by an exponential twiddle factor and store the result in another
1428 index location, according to the digit reversal permutation of $P$.
1429
1430 Here is the algorithm to implement the mixed-radix FFT,
1431 %
1432 \begin{algorithm}
1433 \FOR{$i = 1 \dots n_f$}
1434 \FOR{$a = 0 \dots q-1$}
1435 \FOR{$b = 0 \dots p_{i-1} - 1$}
1436 \FOR{$\lambda = 0 \dots f-1$}
1437 \STATE{$t_\lambda \Leftarrow
1438        \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') v_{b+\lambda'm+ap_{i-1}}$}
1439        \COMMENT{DFT matrix-multiply module}
1440 \ENDFOR
1441 \FOR{$\lambda = 0 \dots f-1$}
1442 \STATE{$v'_{(af+\lambda)p_{i-1}+b}
1443         \Leftarrow \omega^{\lambda a}_{q_{i-1}} t_\lambda$}
1444 \ENDFOR
1445 \ENDFOR
1446 \ENDFOR
1447 \STATE{$v \Leftarrow v'$}
1448 \ENDFOR
1449 \end{algorithm}
1450 %
1451 \subsection{Details of the implementation}
1452 %
1453 First the function {\tt gsl\_fft\_complex\_wavetable\_alloc} allocates
1454 $n$ elements of scratch space (to hold the vector $v'$ for each
1455 iteration) and $n$ elements for a trigonometric lookup table of
1456 twiddle factors.
1457
1458 Then the length $n$ must be factorized. There is a general
1459 factorization function {\tt gsl\_fft\_factorize} which takes a list of
1460 preferred factors. It first factors out the preferred factors and then
1461 removes general remaining prime factors.
1462
1463 The algorithm used to generate the trigonometric lookup table is
1464 %
1465 \begin{algorithm}
1466 \FOR {$a = 1 \dots n_f$}
1467 \FOR {$b = 1 \dots f_i - 1$}
1468 \FOR {$c = 1 \dots q_i$}
1469 \STATE $\mbox{trig[k++]} = \exp(- 2\pi i b c p_{a-1}/N)$
1470 \ENDFOR
1471 \ENDFOR
1472 \ENDFOR
1473 \end{algorithm}
1474 %
1475 Note that $\sum_{1}^{n_f} \sum_{0}^{f_i-1} \sum_{1}^{q_i} =
1476 \sum_{1}^{n_f} (f_i-1)q_i = n - 1$ so $n$ elements are always
1477 sufficient to store the lookup table. This is chosen because we need
1478 to compute $\omega_{q_i-1}^{\lambda a} t_\lambda$ in
1479 the FFT. In terms of the lookup table we can write this as,
1480 %
1481 \begin{eqnarray}
1482 \omega_{q_{i-1}}^{\lambda a} t_\lambda
1483 &=&  \exp(-2\pi i \lambda a/q_{i-1}) t_\lambda \\
1484 &=&  \exp(-2\pi i \lambda a p_{i-1}/N) t_\lambda \\
1485 &=& \left\{
1486     \begin{array}{ll}
1487     t_\lambda & a=0 \\
1488     \mbox{trig}[\mbox{twiddle[i]}+\lambda q+(a-1)] t_\lambda & a\not=0
1489 \end{array}
1490 \right.
1491 \end{eqnarray}
1492 %
1493 \begin{sloppypar}
1494 \noindent
1495 The array {\tt twiddle[i]} maintains a set of pointers into {\tt trig}
1496 for the starting points for the outer loop.  The core of the
1497 implementation is {\tt gsl\_fft\_complex}. This function loops over
1498 the chosen factors of $N$, computing the iteration $v'=T_i v$ for each
1499 pass. When the DFT for a factor is implemented the iteration is
1500 handed-off to a dedicated small-$N$ module, such as {\tt
1501 gsl\_fft\_complex\_pass\_3} or {\tt
1502 gsl\_fft\_complex\_pass\_5}.  Unimplemented factors are handled
1503 by the general-$N$ routine {\tt gsl\_fft\_complex\_pass\_n}. The
1504 structure of one of the small-$N$ modules is a simple transcription of
1505 the basic algorithm given above.  Here is an example for {\tt
1506 gsl\_fft\_complex\_pass\_3}. For a pass with a factor of 3 we have to
1507 calculate the following expression,
1508 \end{sloppypar}%
1509 \begin{equation}
1510 v'_{(a f + \lambda) p_{i-1} + b}
1511 =
1512 \sum_{\lambda' = 0,1,2}
1513 \omega^{\lambda a}_{q_{i-1}} W^{\lambda \lambda'}_{3}
1514 v_{b + \lambda' m + a p_{i-1}}
1515 \end{equation}
1516 %
1517 for $b = 0 \dots p_{i-1} - 1$, $a = 0 \dots q_{i} - 1$ and $\lambda =
1518 0, 1, 2$.  This is implemented as,
1519 %
1520 \begin{algorithm}
1521 \FOR {$a = 0 \dots q-1$}
1522 \FOR {$b = 0 \dots p_{i-1} - 1$}
1523 \STATE {$
1524         \left(
1525         \begin{array}{c}
1526         t_0 \\ t_1 \\ t_2
1527         \end{array}
1528         \right)
1529         =
1530         \left(
1531         \begin{array}{ccc}
1532         W^{0}_3 & W^{0}_3 & W^{0}_3 \\
1533         W^{0}_3 & W^{1}_3 & W^{2}_3 \\
1534         W^{0}_3 & W^{2}_3 & W^{4}_3
1535         \end{array}
1536         \right)
1537         \left(
1538         \begin{array}{l}
1539         v_{b + a p_{i-1}} \\
1540         v_{b + a p_{i-1} + m} \\
1541         v_{b + a p_{i-1} +2m}
1542         \end{array}
1543         \right)$}
1544         \STATE {$ v'_{a p_{i} + b}  = t_0$}
1545         \STATE {$ v'_{a p_{i} + b + p_{i-1}} = \omega^{a}_{q_{i-1}} t_1$}
1546         \STATE {$ v'_{a p_{i} + b + 2 p_{i-1}} = \omega^{2a}_{q_{i-1}} t_2$}
1547 \ENDFOR
1548 \ENDFOR
1549 \end{algorithm}
1550 %
1551 In the code we use the variables {\tt from0}, {\tt from1}, {\tt from2}
1552 to index the input locations,
1553 %
1554 \begin{eqnarray}
1555 \mbox{\tt from0} &=& b + a p_{i-1} \\
1556 \mbox{\tt from1} &=& b + a p_{i-1} + m \\
1557 \mbox{\tt from2} &=& b + a p_{i-1} + 2m
1558 \end{eqnarray}
1559 %
1560 and the variables {\tt to0}, {\tt to1}, {\tt to2} to index the output
1561 locations in the scratch vector $v'$,
1562 %
1563 \begin{eqnarray}
1564 \mbox{\tt to0} &=& b + a p_{i} \\
1565 \mbox{\tt to1} &=& b + a p_{i} + p_{i-1} \\
1566 \mbox{\tt to2} &=& b + a p_{i} + 2 p_{i-1}
1567 \end{eqnarray}
1568 %
1569 The DFT matrix multiplication is computed using the optimized
1570 sub-transform modules given in the next section. The twiddle factors
1571 $\omega^a_{q_{i-1}}$ are taken out of the {\tt trig} array.
1572
1573 To compute the inverse transform we go back to the definition of the
1574 fourier transform and note that the inverse matrix is just the complex
1575 conjugate of the forward matrix (with a factor of $1/N$),
1576 %
1577 \begin{equation}
1578 W^{-1}_N = W^*_N / N
1579 \end{equation}
1580 %
1581 Therefore we can easily compute the inverse transform by conjugating
1582 all the complex elements of the DFT matrices and twiddle factors that
1583 we apply. (An alternative strategy is to conjugate the input data,
1584 take a forward transform, and then conjugate the output data).
1585
1586 \section{Fast Sub-transform Modules}
1587 %
1588 To implement the mixed-radix FFT we still need to compute the
1589 small-$N$ DFTs for each factor. Fortunately many highly-optimized
1590 small-$N$ modules are available, following the work of Winograd who
1591 showed how to derive efficient small-$N$ sub-transforms by number
1592 theoretic techniques.
1593
1594 The algorithms in this section all compute,
1595 %
1596 \begin{equation}
1597 x_a = \sum_{b=0}^{N-1} W_N^{ab} z_b
1598 \end{equation}
1599 %
1600 The sub-transforms given here are the ones recommended by Temperton
1601 and differ slightly from the canonical Winograd modules. According to
1602 Temperton~\cite{temperton83} they are slightly more robust against
1603 rounding errors and trade off some additions for multiplications.
1604 %
1605 For the $N=2$ DFT,
1606 %
1607 \begin{equation}
1608 \begin{array}{ll}
1609 x_0 = z_0 + z_1, &
1610 x_1 = z_0 - z_1.
1611 \end{array}
1612 \end{equation}
1613 %
1614 For the $N=3$ DFT,
1615 %
1616 \begin{equation}
1617 \begin{array}{lll}
1618 t_1 = z_1 + z_2, &
1619 t_2 = z_0 - t_1/2, &
1620 t_3 = \sin(\pi/3) (z_1 - z_2),
1621 \end{array}
1622 \end{equation}
1623 \begin{equation}
1624 \begin{array}{lll}
1625 x_0 = z_0 + t_1, &
1626 x_1 = t_2 + i t_3, &
1627 x_2 = t_2 - i t_3.
1628 \end{array}
1629 \end{equation}
1630 %
1631 The $N=4$ transform involves only additions and subtractions,
1632 %
1633 \begin{equation}
1634 \begin{array}{llll}
1635 t_1 = z_0 + z_2, &
1636 t_2 = z_1 + z_3, &
1637 t_3 = z_0 - z_2, &
1638 t_4 = z_1 - z_3,
1639 \end{array}
1640 \end{equation}
1641 \begin{equation}
1642 \begin{array}{llll}
1643 x_0 = t_1 + t_2, &
1644 x_1 = t_3 + i t_4, &
1645 x_2 = t_1 - t_2, &
1646 x_3 = t_3 - i t_4.
1647 \end{array}
1648 \end{equation}
1649 %
1650 For the $N=5$ DFT,
1651 %
1652 \begin{equation}
1653 \begin{array}{llll}
1654 t_1 = z_1 + z_4, &
1655 t_2 = z_2 + z_3, &
1656 t_3 = z_1 - z_4, &
1657 t_4 = z_2 - z_3,
1658 \end{array}
1659 \end{equation}
1660 \begin{equation}
1661 \begin{array}{llll}
1662 t_5 = t_1 + t_2, &
1663 t_6 = (\sqrt{5}/4) (t_1 - t_2), &
1664 t_7 = z_0 - t_5/4, \\
1665 \end{array}
1666 \end{equation}
1667 \begin{equation}
1668 \begin{array}{llll}
1669 t_8 = t_7 + t_6, &
1670 t_9 = t_7 - t_6, \\
1671 \end{array}
1672 \end{equation}
1673 \begin{equation}
1674 \begin{array}{llll}
1675 t_{10} = \sin(2\pi/5) t_3 + \sin(2\pi/10) t_4, &
1676 t_{11} = \sin(2\pi/10) t_3 - \sin(2\pi/5) t_4,
1677 \end{array}
1678 \end{equation}
1679 \begin{equation}
1680 \begin{array}{llll}
1681 x_0 = z_0 + t_5,
1682 \end{array}
1683 \end{equation}
1684 \begin{equation}
1685 \begin{array}{llll}
1686 x_1 = t_8 + i t_{10}, &
1687 x_2 = t_9 + i t_{11},
1688 \end{array}
1689 \end{equation}
1690 \begin{equation}
1691 \begin{array}{llll}
1692 x_3 = t_9 - i t_{11}, &
1693 x_4 = t_8 - i t_{10}.
1694 \end{array}
1695 \end{equation}
1696 %
1697 The DFT matrix for $N=6$ can be written as a combination of $N=3$ and
1698 $N=2$ transforms with index permutations,
1699 %
1700 \begin{equation}
1701 \left(
1702 \begin{array}{c}
1703 x_0 \\
1704 x_4 \\
1705 x_2 \\
1706 \hline x_3 \\
1707 x_1 \\
1708 x_5
1709 \end{array}
1710 \right)
1711 =
1712 \left(
1713 \begin{array}{ccc|ccc}
1714   &   &  &  &   & \\
1715   &W_3&  &  &W_3& \\
1716   &   &  &  &   & \\
1717 \hline  &   &  &  &   & \\
1718   &W_3&  &  &-W_3& \\
1719   &   &  &  &   &
1720 \end{array}
1721 \right)
1722 \left(
1723 \begin{array}{c}
1724 z_0 \\
1725 z_2 \\
1726 z_4 \\
1727 \hline z_3 \\
1728 z_5 \\
1729 z_1
1730 \end{array}
1731 \right)
1732 \end{equation}
1733 %
1734 This simplification is an example of the Prime Factor Algorithm, which
1735 can be used because the factors 2 and 3 are mutually prime.  For more
1736 details consult one of the books on number theory for
1737 FFTs~\cite{elliott82,blahut}. We can take advantage of the simple
1738 indexing scheme of the PFA to write the $N=6$ DFT as,
1739 %
1740 \begin{equation}
1741 \begin{array}{lll}
1742 t_1 = z_2 + z_4, &
1743 t_2 = z_0 - t_1/2, &
1744 t_3 = \sin(\pi/3) (z_2 - z_4),
1745 \end{array}
1746 \end{equation}
1747 \begin{equation}
1748 \begin{array}{lll}
1749 t_4 = z_5 + z_1, &
1750 t_5 = z_3 - t_4/2, &
1751 t_6 = \sin(\pi/3) (z_5 - z_1),
1752 \end{array}
1753 \end{equation}
1754 \begin{equation}
1755 \begin{array}{lll}
1756 t_7 = z_0 + t_1, &
1757 t_8 = t_2 + i t_3, &
1758 t_9 = t_2 - i t_3,
1759 \end{array}
1760 \end{equation}
1761 \begin{equation}
1762 \begin{array}{lll}
1763 t_{10} = z_3 + t_4, &
1764 t_{11} = t_5 + i t_6, &
1765 t_{12} = t_5 - i t_6,
1766 \end{array}
1767 \end{equation}
1768 \begin{equation}
1769 \begin{array}{lll}
1770 x_0 = t_7 + t_{10}, &
1771 x_4 = t_8 + t_{11}, &
1772 x_2 = t_9 + t_{12},
1773 \end{array}
1774 \end{equation}
1775 \begin{equation}
1776 \begin{array}{lll}
1777 x_3 = t_7 - t_{10}, &
1778 x_1 = t_8 - t_{11}, &
1779 x_5 = t_9 - t_{12}.
1780 \end{array}
1781 \end{equation}
1782
1783 For any remaining general factors we use Singleton's efficient method
1784 for computing a DFT~\cite{singleton}. Although it is an $O(N^2)$
1785 algorithm it does reduce the number of multiplications by a factor of
1786 4 compared with a naive evaluation of the DFT. If we look at the
1787 general stucture of a DFT matrix, shown schematically below,
1788 %
1789 \begin{equation}
1790 \left(
1791 \begin{array}{c}
1792 h_0 \\
1793 h_1 \\
1794 h_2 \\
1795 \vdots \\
1796 h_{N-2} \\
1797 h_{N-1}
1798 \end{array}
1799 \right)
1800 =
1801 \left(
1802 \begin{array}{cccccc}
1803 1 & 1 & 1 & \cdots & 1 & 1 \\
1804 1 & W & W & \cdots & W & W \\
1805 1 & W & W & \cdots & W & W \\
1806 \vdots & \vdots & \vdots & \cdots & \vdots & \vdots \\
1807 1 & W & W & \cdots & W & W \\
1808 1 & W & W & \cdots & W & W
1809 \end{array}
1810 \right)
1811 \left(
1812 \begin{array}{c}
1813 g_0 \\
1814 g_1 \\
1815 g_2 \\
1816 \vdots \\
1817 g_{N-2} \\
1818 g_{N-1}
1819 \end{array}
1820 \right)
1821 \end{equation}
1822 %
1823 we see that the outer elements of the DFT matrix are all unity. We can
1824 remove these trivial multiplications but we will still be left with an
1825 $(N-1) \times (N-1)$ sub-matrix of complex entries, which would appear
1826 to require $(N-1)^2$ complex multiplications.  Singleton's method,
1827 uses symmetries of the DFT matrix to convert the complex
1828 multiplications to an equivalent number of real multiplications. We
1829 start with the definition of the DFT in component form,
1830 %
1831 \begin{equation}
1832 a_k + i b_k = \sum_{j=0} (x_j+iy_j)(\cos(2\pi jk/f) - i\sin(2\pi jk/f))
1833 \end{equation}
1834 %
1835 The zeroth component can be computed using only additions,
1836 %
1837 \begin{equation}
1838 a_0 + i b_0 = \sum_{j=0}^{(f-1)} x_j + i y_j
1839 \end{equation}
1840 %
1841 We can rewrite the remaining components as,
1842 %
1843 \begin{eqnarray}
1844 a_k + i b_k & =   x_0 + i y_0 & +
1845   \sum_{j=1}^{(f-1)/2} (x_j + x_{f-j}) \cos(2\pi jk/f)
1846  + (y_j - y_{f-j}) \sin(2\pi jk/f) \\
1847 & & + i\sum_{j=1}^{(f-1)/2} (y_j + y_{f-j}) \cos(2\pi jk/f)
1848  - (x_j - x_{f-j}) \sin(2\pi jk/f)
1849 \end{eqnarray}
1850 %
1851 by using the following trigonometric identities,
1852 %
1853 \begin{eqnarray}
1854  \cos(2\pi(f-j)k/f) &=& \phantom{-}\cos(2\pi jk/f) \\
1855  \sin(2\pi(f-j)k/f) &=&  -\sin(2\pi jk/f)
1856 \end{eqnarray}
1857 %
1858 These remaining components can all be computed using four partial
1859 sums,
1860 %
1861 \begin{eqnarray}
1862 a_k + i b_k & = & (a^+_k - a^-_k) + i (b^+_k + b^-_k) \\
1863 a_{f-k} + i b_{f-k} & = & (a^+_k + a^-_k) + i (b^+_k - b^-_k)
1864 \end{eqnarray}
1865 %
1866 for $k = 1, 2, \dots, (f-1)/2$, where,
1867 %
1868 \begin{eqnarray}
1869 a^+_k &=& x_0 + \sum_{j=1}^{(f-1)/2} (x_j + x_{f-j}) \cos(2\pi jk/f) \\
1870 a^-_k &=& \phantom{x_0} - \sum_{j=1}^{(f-1)/2} (y_j - y_{f-j}) \sin(2\pi jk/f) \\
1871 b^+_k &=& y_0 + \sum_{j=1}^{(f-1)/2} (y_j + y_{f-j}) \cos(2\pi jk/f) \\
1872 b^-_k &=& \phantom{y_0} - \sum_{j=1}^{(f-1)/2} (x_j - x_{f-j}) \sin(2\pi jk/f)
1873 \end{eqnarray}
1874 %
1875 Note that the higher components $k'=f-k$ can be obtained directly
1876 without further computation once $a^+$, $a^-$, $b^+$ and $b^-$ are
1877 known. There are $4 \times (f-1)/2$ different sums, each involving
1878 $(f-1)/2$ real multiplications, giving a total of $(f-1)^2$ real
1879 multiplications instead of $(f-1)^2$ complex multiplications.
1880
1881 To implement Singleton's method we make use of the input and output
1882 vectors $v$ and $v'$ as scratch space, copying data back and forth
1883 between them to obtain the final result.  First we use $v'$ to store
1884 the terms of the symmetrized and anti-symmetrized vectors of the form
1885 $x_j + x_{f-j}$ and $x_j - y_{f-j}$. Then we multiply these by the
1886 appropriate trigonometric factors to compute the partial sums $a^+$,
1887 $a^-$, $b^+$ and $b^-$, storing the results $a_k + i b_k$ and $a_{f-k}
1888 + i b_{f-k}$ back in $v$. Finally we multiply the DFT output by any
1889 necessary twiddle factors and place the results in $v'$.
1890
1891 \section{FFTs for real data}
1892 %
1893 This section is based on the articles {\em Fast Mixed-Radix Real
1894   Fourier Transforms} by Clive Temperton~\cite{temperton83real} and
1895 {\em Real-Valued Fast Fourier Transform Algorithms} by Sorensen,
1896 Jones, Heideman and Burrus~\cite{burrus87real}. The DFT of a real
1897 sequence has a special symmetry, called a {\em conjugate-complex} or
1898 {\em half-complex} symmetry,
1899 %
1900 \begin{equation}
1901 h(a) = h(N-a)^*
1902 \end{equation}
1903 %
1904 The element $h(0)$ is real, and when $N$ is even $h(N/2)$ is also
1905 real. It is straightforward to prove the symmetry,
1906 %
1907 \begin{eqnarray}
1908 h(a) &=& \sum W^{ab}_N g(b) \\
1909 h(N-a)^* &=& \sum W^{-(N-a)b}_N g(b)^*  \\
1910          &=& \sum W^{-Nb}_N W^{ab}_N g(b) \qquad{(W^N_N=1)} \\
1911          &=& \sum W^{ab}_N g(b)
1912 \end{eqnarray}
1913 %
1914 Real-valued data is very common in practice (perhaps more common that
1915 complex data) so it is worth having efficient FFT routines for real
1916 data. In principle an FFT for real data should need half the
1917 operations of an FFT on the equivalent complex data (where the
1918 imaginary parts are set to zero). There are two different strategies
1919 for computing FFTs of real-valued data:
1920
1921 One strategy is to ``pack'' the real data (of length $N$) into a
1922 complex array (of length $N/2$) by index transformations. A complex
1923 FFT routine can then be used to compute the transform of that array.
1924 By further index transformations the result can actually by
1925 ``unpacked'' to the FFT of the original real data. It is also possible
1926 to do two real FFTs simultaneously by packing one in the real part and
1927 the other in the imaginary part of the complex array.  These
1928 techniques have some disadvantages. The packing and unpacking
1929 procedures always add $O(N)$ operations, and packing a real array of
1930 length $N$ into a complex array of length $N/2$ is only possible if
1931 $N$ is even. In addition, if two unconnected datasets with very
1932 different magnitudes are packed together in the same FFT there could
1933 be ``cross-talk'' between them due to a loss of precision.
1934
1935 A more straightforward strategy is to start with an FFT algorithm,
1936 such as the complex mixed-radix algorithm, and prune out all the
1937 operations involving the zero imaginary parts of the initial data. The
1938 FFT is linear so the imaginary part of the data can be decoupled from
1939 the real part. This procedure leads to a dedicated FFT for real-valued
1940 data which works for any length and does not perform any unnecessary
1941 operations. It also allows us to derive a corresponding inverse FFT
1942 routine which transforms a half-complex sequence back into real data.
1943
1944 \subsection{Radix-2 FFTs for real data}
1945 %
1946 Before embarking on the full mixed-radix real FFT we'll start with the
1947 radix-2 case. It contains all the essential features of the
1948 general-$N$ algorithm. To make it easier to see the analogy between
1949 the two we will use the mixed-radix notation to describe the
1950 factors. The factors are all 2,
1951 %
1952 \begin{equation}
1953 f_1 = 2, f_2 = 2, \dots, f_{n_f} = 2
1954 \end{equation}
1955 %
1956 and the products $p_i$ are powers of 2,
1957 %
1958 \begin{eqnarray}
1959 p_0 & = & 1 \\
1960 p_1 & = & f_1 = 2 \\
1961 p_2 & = & f_1 f_2 = 4 \\
1962 \dots &=& \dots \\
1963 p_i & = & f_1 f_2 \dots f_i = 2^i
1964 \end{eqnarray}
1965 %
1966 Using this notation we can rewrite the radix-2 decimation-in-time
1967 algorithm as,
1968 %
1969 \begin{algorithm}
1970 \STATE bit-reverse ordering of $g$
1971 \FOR {$i = 1 \dots n$}
1972   \FOR {$a = 0 \dots p_{i-1} - 1$}
1973     \FOR{$b = 0 \dots q_i - 1$}
1974         \STATE{$
1975                 \left(
1976                 \begin{array}{c}
1977                 g(b p_i + a) \\
1978                 g(b p_i + p_{i-1} + a)
1979                 \end{array}
1980                 \right)
1981                 =
1982                 \left(
1983                 \begin{array}{c}
1984                 g(b p_i + a) + W^a_{p_i} g(b p_i + p_{i-1} + a) \\
1985                 g(b p_i + a) - W^a_{p_i} g(b p_i + p_{i-1} + a)
1986                 \end{array}
1987                 \right) $}
1988      \ENDFOR
1989   \ENDFOR
1990 \ENDFOR
1991 \end{algorithm}
1992 %
1993 where we have used $p_i = 2 \Delta$, and factored $2 \Delta$ out of
1994 the original definition of $b$ ($b \to b p_i$).
1995
1996 If we go back to the original recurrence relations we can see how to
1997 write the intermediate results in a way which make the
1998 real/half-complex symmetries explicit at each step. The first pass is
1999 just a set of FFTs of length-2 on real values,
2000 %
2001 \begin{equation}
2002 g_1([b_0 b_1 b_2 a_0]) = \sum_{b_3} W^{a_0 b_3}_2 g([b_0 b_1 b_2 b_3])
2003 \end{equation}
2004 %
2005 Using the symmetry $FFT(x)_k = FFT(x)^*_{N-k}$ we have the reality
2006 condition,
2007 %
2008 \begin{eqnarray}
2009 g_1([b_0 b_1 b_2 0]) &=& \mbox{real} \\
2010 g_1([b_0 b_1 b_2 1]) &=& \mbox{real'}
2011 \end{eqnarray}
2012 %
2013 In the next pass we have a set of length-4 FFTs on the original data,
2014 %
2015 \begin{eqnarray}
2016 g_2([b_0 b_1 b_1 a_0])
2017 &=&
2018 \sum_{b_2}\sum_{b_3}
2019 W^{[a_1 a_0]b_2}_4 W^{a_0 b_3}_2
2020 g([b_0 b_1 b_2 b_3]) \\
2021 &=&
2022 \sum_{b_2}\sum_{b_3}
2023 W^{[a_1 a_0][b_3 b_2]}_4
2024 g([b_0 b_1 b_2 b_3])
2025 \end{eqnarray}
2026 %
2027 This time symmetry gives us the following conditions on the
2028 transformed data,
2029 %
2030 \begin{eqnarray}
2031 g_2([b_0 b_1 0 0]) &=& \mbox{real} \\
2032 g_2([b_0 b_1 0 1]) &=& x + i y  \\
2033 g_2([b_0 b_1 1 0]) &=& \mbox{real'} \\
2034 g_2([b_0 b_1 1 1]) &=& x - i y
2035 \end{eqnarray}
2036 %
2037 We can see a pattern emerging here: the $i$-th pass computes a set of
2038 independent length-$2^i$ FFTs on the original real data,
2039 %
2040 \begin{eqnarray}
2041 g_i ( b p_i  + a ) = \sum_{a' = 0}^{p_i-1} W_{p_i}^{aa'} g(b p_i + a')
2042 \quad
2043 \mbox{for $b = 0 \dots q_i - 1$}
2044 \end{eqnarray}
2045 %
2046 As a consequence the we can apply the symmetry for an FFT of real data
2047 to all the intermediate results -- not just the final result.
2048 In general after the $i$-th pass we will
2049 have the symmetry,
2050 %
2051 \begin{eqnarray}
2052 g_i(b p_i) &=& \mbox{real} \\
2053 g_i(b p_i + a) &=& g_i(b p_i + p_i - a)^* \qquad a = 1 \dots p_{i}/2 - 1  \\
2054 g_i(b p_i + p_{i}/2) &=& \mbox{real'}
2055 \end{eqnarray}
2056 %
2057 In the next section we'll show that this is a general property of
2058 decimation-in-time algorithms. The same is not true for the
2059 decimation-in-frequency algorithm, which does not have any simple
2060 symmetries in the intermediate results.
2061
2062 Since we can obtain the values of $g_i(b p_i + a)$ for $a > p_i/2$
2063 from the values for $a < p_i/2$ we can cut our computation and
2064 storage in half compared with the full-complex case.
2065 %
2066 We can easily rewrite the algorithm to show how the computation can be
2067 halved, simply by limiting all terms to involve only values for $a
2068 \leq p_{i}/2$. Whenever we encounter a term $g_i(b p_i + a)$ with $a >
2069 p_{i}/2$ we rewrite it in terms of its complex symmetry partner,
2070 $g_i(b p_i + a')^*$, where $a' = p_i - a$.  The butterfly computes two
2071 values for each value of $a$, $b p_i + a$ and $b p_i + p_{i-1} - a$,
2072 so we actually only need to compute from $a = 0$ to $p_{i-1}/2$.  This
2073 gives the following algorithm,
2074 %
2075 \begin{algorithm}
2076 \FOR {$a = 0 \dots p_{i-1}/2$}
2077   \FOR{$b = 0 \dots q_i - 1$}
2078         \STATE{$
2079                 \left(
2080                 \begin{array}{c}
2081                 g(b p_i + a) \\
2082                 g(b p_i + p_{i-1} - a)^*
2083                 \end{array}
2084                 \right)
2085                 =
2086                 \left(
2087                 \begin{array}{c}
2088                 g(b p_{i} + a) + W^a_{p_i} g(b p_i + p_{i-1} + a) \\
2089                 g(b p_{i} + a) - W^a_{p_i} g(b p_i + p_{i-1} + a)
2090                 \end{array}
2091                 \right) $}
2092      \ENDFOR
2093   \ENDFOR
2094 \end{algorithm}
2095 %
2096 Although we have halved the number of operations we also need a
2097 storage arrangement which will halve the memory requirement. The
2098 algorithm above is still formulated in terms of a complex array $g$,
2099 but the input to our routine will naturally be an array of $N$ real
2100 values which we want to use in-place.
2101
2102 Therefore we need a storage scheme which lays out the real and
2103 imaginary parts within the real array, in a natural way so that there
2104 is no need for complicated index calculations. In the radix-2
2105 algorithm we do not have any additional scratch space. The storage
2106 scheme has to be designed to accommodate the in-place calculation
2107 taking account of dual node pairs.
2108
2109 Here is a scheme which takes these restrictions into account: On the
2110 $i$-th pass we store the real part of $g(b p_i + a)$ in location $b
2111 p_i + a$. We store the imaginary part in location $b p_i + p_i -
2112 a$. This is the redundant location which corresponds to the conjugate
2113 term $g(b p_i + a)^* = g(b p_i + p_i -a)$, so it is not needed.  When
2114 the results are purely real (as in the case $a = 0$ and $a = p_i/2$ we
2115 store only the real part and drop the zero imaginary part).
2116
2117 This storage scheme has to work in-place, because the radix-2 routines
2118 should not use any scratch space. We will now check the in-place
2119 property for each butterfly.  A crucial point is that the scheme is
2120 pass-dependent. Namely, when we are computing the result for pass $i$
2121 we are reading the results of pass $i-1$, and we must access them
2122 using the scheme from the previous pass, i.e. we have to remember that
2123 the results from the previous pass were stored using $b p_{i-1} + a$,
2124 not $b p_i + a$, and the symmetry for these results will be $g_{i-1}(b
2125 p_{i-1} + a) = g_{i-1}(b p_{i-1} + p_{i-1} - a)^*$. To take this into
2126 account we'll write the right hand side of the iteration, $g_{i-1}$,
2127 in terms of $p_{i-1}$. For example, instead of $b p_i$, which occurs
2128 naturally in $g_i(b p_i + a)$ we will use $2 b p_{i-1}$, since $p_i =
2129 2 p_{i-1}$.
2130
2131 Let's start with the butterfly for $a = 0$,
2132 %
2133 \begin{equation}
2134 \left(
2135 \begin{array}{c}
2136 g(b p_i) \\
2137 g(b p_i + p_{i-1})^*
2138 \end{array}
2139 \right)
2140 =
2141 \left(
2142 \begin{array}{c}
2143 g(2 b p_{i-1}) + g((2 b + 1) p_{i-1}) \\
2144 g(2 b p_{i-1}) - g((2 b + 1) p_{i-1})
2145 \end{array}
2146 \right)
2147 \end{equation}
2148 %
2149 By the symmetry $g_{i-1}(b p_{i-1} + a) = g_{i-1}(b p_{i-1} + p_{i-1}
2150 - a)^*$ all the inputs are purely real.  The input $g(2 b p_{i-1})$ is
2151 read from location $2 b p_{i-1}$ and $g((2 b + 1) p_{i-1})$ is read
2152 from the location $(2 b + 1) p_{i-1}$. Here is the full breakdown,
2153 %
2154 \begin{center}
2155 \renewcommand{\arraystretch}{1.5}
2156 \begin{tabular}{|l|lll|}
2157 \hline Term & & Location & \\
2158 \hline
2159 $g(2 b p_{i-1})$
2160         & real part & $2 b p_{i-1} $ &$= b p_i$ \\
2161         & imag part & --- & \\
2162 $g((2 b+1) p_{i-1})$
2163         & real part & $(2 b + 1) p_{i-1}  $&$= b p_i + p_{i-1} $ \\
2164         & imag part & --- & \\
2165 \hline
2166 $g(b p_{i})$
2167         & real part & $b p_i$ &\\
2168         & imag part & --- & \\
2169 $g(b p_{i} + p_{i-1})$
2170         & real part & $b p_i + p_{i-1}$& \\
2171         & imag part & --- &  \\
2172 \hline
2173 \end{tabular}
2174 \end{center}
2175 %
2176 The conjugation of the output term $g(b p_i + p_{i-1})^*$ is
2177 irrelevant here since the results are purely real. The real results
2178 are stored in locations $b p_i$ and $b p_i + p_{i-1}$, which
2179 overwrites the inputs in-place.
2180
2181 The general butterfly for $a = 1 \dots p_{i-1}/2 - 1$ is,
2182 %
2183 \begin{equation}
2184 \left(
2185 \begin{array}{c}
2186 g(b p_i + a) \\
2187 g(b p_i + p_{i-1} - a)^*
2188 \end{array}
2189 \right)
2190 =
2191 \left(
2192 \begin{array}{c}
2193 g(2 b p_{i-1} + a) + W^a_{p_i} g((2 b + 1) p_{i-1} + a) \\
2194 g(2 b p_{i-1} + a) - W^a_{p_i} g((2 b + 1) p_{i-1} + a)
2195 \end{array}
2196 \right)
2197 \end{equation}
2198 %
2199 All the terms are complex. To store a conjugated term like $g(b' p_i +
2200 a')^*$ where $a > p_i/2$ we take the real part and store it in
2201 location $b' p_i + a'$ and then take imaginary part, negate it, and
2202 store the result in location $b' p_i + p_i - a'$.
2203
2204 Here is the full breakdown of the inputs and outputs from the
2205 butterfly,
2206 %
2207 \begin{center}
2208 \renewcommand{\arraystretch}{1.5}
2209 \begin{tabular}{|l|lll|}
2210 \hline Term & & Location & \\
2211 \hline
2212 $g(2 b p_{i-1} + a)$
2213         & real part & $2 b p_{i-1} + a $ &$= b p_i + a$ \\
2214         & imag part & $2 b p_{i-1} + p_{i-1} - a$ &$= b p_i + p_{i-1} - a$ \\
2215 $g((2 b+1) p_{i-1} + a)$
2216         & real part & $(2 b+1) p_{i-1} + a $&$= b p_i + p_{i-1} + a $ \\
2217         & imag part & $(2 b+1) p_{i-1} + p_{i-1} - a $&$= b p_i + p_i - a$\\
2218 \hline
2219 $g(b p_{i} + a)$
2220         & real part & $b p_i + a$ &\\
2221         & imag part & $b p_i + p_i - a$& \\
2222 $g(b p_{i} + p_{i-1} - a)$
2223         & real part & $b p_i + p_{i-1} - a$& \\
2224         & imag part & $b p_i + p_{i-1} + a$&  \\
2225 \hline
2226 \end{tabular}
2227 \end{center}
2228 %
2229 By comparing the input locations and output locations we can see
2230 that the calculation is done in place.
2231
2232 The final butterfly for $a = p_{i-1}/2$ is,
2233 %
2234 \begin{equation}
2235 \left(
2236 \begin{array}{c}
2237 g(b p_i + p_{i-1}/2) \\
2238 g(b p_i + p_{i-1} - p_{i-1}/2)^*
2239 \end{array}
2240 \right)
2241 =
2242 \left(
2243 \begin{array}{c}
2244 g(2 b p_{i-1} + p_{i-1}/2) - i g((2 b + 1) p_{i-1} + p_{i-1}/2) \\
2245 g(2 b p_{i-1} + p_{i-1}/2) + i g((2 b + 1) p_{i-1} + p_{i-1}/2)
2246 \end{array}
2247 \right)
2248 \end{equation}
2249 %
2250 where we have substituted for the twiddle factor, $W^a_{p_i} = -i$,
2251 %
2252 \begin{eqnarray}
2253 W^{p_{i-1}/2}_{p_i} &=& \exp(-2\pi i p_{i-1}/2 p_i) \\
2254                     &=& \exp(-2\pi i /4) \\
2255                     &=& -i
2256 \end{eqnarray}
2257 %
2258 For this butterfly the second line is just the conjugate of the first,
2259 because $p_{i-1} - p_{i-1}/2 = p_{i-1}/2$. Therefore we only need to
2260 consider the first line. The breakdown of the inputs and outputs is,
2261 %
2262 \begin{center}
2263 \renewcommand{\arraystretch}{1.5}
2264 \begin{tabular}{|l|lll|}
2265 \hline Term & & Location & \\
2266 \hline
2267 $g(2 b p_{i-1} + p_{i-1}/2)$
2268         & real part & $2 b p_{i-1} + p_{i-1}/2 $ &$= b p_i + p_{i-1}/2$ \\
2269         & imag part & --- & \\
2270 $g((2 b + 1) p_{i-1} + p_{i-1}/2)$
2271         & real part & $(2 b + 1) p_{i-1} + p_{i-1}/2 $&$= b p_i + p_{i} - p_{i-1}/2 $ \\
2272         & imag part & --- & \\
2273 \hline
2274 $g(b p_{i} + p_{i-1}/2)$
2275         & real part & $b p_i + p_{i-1}/2$ &\\
2276         & imag part & $b p_i + p_i - p_{i-1}/2$& \\
2277 \hline
2278 \end{tabular}
2279 \end{center}
2280 %
2281 By comparing the locations of the inputs and outputs with the
2282 operations in the butterfly we find that this computation is very
2283 simple: the effect of the butterfly is to negate the location $b p_i +
2284 p_i - p_{i-1}/2$ and leave other locations unchanged. This is clearly
2285 an in-place operation.
2286
2287 Here is the radix-2 algorithm for real data, in full, with the cases
2288 of $a=0$, $a=1\dots p_{i-1}/2 - 1$ and $a = p_{i-1}/2$ in separate
2289 blocks,
2290 %
2291 \begin{algorithm}
2292 \STATE bit-reverse ordering of $g$
2293 \FOR {$i = 1 \dots n$}
2294   \FOR{$b = 0 \dots q_i - 1$}
2295   \STATE{$\left(
2296           \begin{array}{c}
2297           g(b p_i) \\
2298           g(b p_i + p_{i-1})
2299           \end{array}
2300           \right)
2301           \Leftarrow
2302           \left(
2303           \begin{array}{c}
2304           g(b p_{i}) + g(b p_{i} + p_{i-1}) \\
2305           g(b p_{i}) - g(b p_{i} + p_{i-1})
2306           \end{array}
2307           \right)$}
2308   \ENDFOR
2309
2310   \FOR {$a = 1 \dots p_{i-1}/2 - 1$}
2311     \FOR{$b = 0 \dots q_i - 1$}
2312         \STATE{$(\Real z_0, \Imag z_0) \Leftarrow
2313                 (g(b p_i + a),  g(b p_i + p_{i-1} - a))$}
2314         \STATE{$(\Real z_1, \Imag z_1) \Leftarrow
2315                 (g(b p_i + p_{i-1} + a), g(b p_i + p_{i} - a))$}
2316         \STATE{$t_0 \Leftarrow z_0 + W^a_{p_i} z_1$}
2317         \STATE{$t_1 \Leftarrow z_0 - W^a_{p_i} z_1$}
2318         \STATE{$(g(b p_{i} + a),g(b p_{i} + p_i - a) \Leftarrow
2319                 (\Real t_0, \Imag t_0)$}
2320         \STATE{$(g(b p_{i} + p_{i-1} - a), g(b p_{i} + p_{i-1} + a))
2321                  \Leftarrow
2322                 (\Real t_1, -\Imag t_1)$}
2323      \ENDFOR
2324   \ENDFOR
2325
2326   \FOR{$b = 0 \dots q_i - 1$}
2327         \STATE{$g(b p_{i} - p_{i-1}/2) \Leftarrow -g(b p_{i} - p_{i-1}/2)$}
2328   \ENDFOR
2329
2330 \ENDFOR
2331 \end{algorithm}
2332 %
2333 We split the loop over $a$ into three parts, $a=0$, $a=1\dots
2334 p_{i-1}/2-1$ and $a = p_{i-1}/2$ for efficiency.  When $a=0$ we have
2335 $W^a_{p_i}=1$ so we can eliminate a complex multiplication within the
2336 loop over $b$. When $a=p_{i-1}/2$ we have $W^a_{p_i} = -i$ which does
2337 not require a full complex multiplication either.
2338
2339
2340 \subsubsection{Calculating the Inverse}
2341 %
2342 The inverse FFT of complex data was easy to calculate, simply by
2343 taking the complex conjugate of the DFT matrix. The input data and
2344 output data were both complex and did not have any special
2345 symmetry. For real data the inverse FFT is more complicated because
2346 the half-complex symmetry of the transformed data is
2347 different from the purely real input data.
2348
2349 We can compute an inverse by stepping backwards through the forward
2350 transform.  To simplify the inversion it's convenient to write the
2351 forward algorithm with the butterfly in matrix form,
2352 %
2353 \begin{algorithm}
2354 \FOR {$i = 1 \dots n$}
2355   \FOR {$a = 0 \dots p_{i-1}/2$}
2356     \FOR{$b = 0 \dots q_i - 1$}
2357         \STATE{$
2358                 \left(
2359                 \begin{array}{c}
2360                 g(b p_i + a) \\
2361                 g(b p_i + p_{i-1} + a)
2362                 \end{array}
2363                 \right)
2364                 =
2365                 \left(
2366                 \begin{array}{cc}
2367                 1 & W^a_{p_{i}} \\
2368                 1 & -W^a_{p_{i}}
2369                 \end{array}
2370                 \right)
2371                 \left(
2372                 \begin{array}{c}
2373                 g(2 b p_{i-1} + a) \\
2374                 g((2 b + 1) p_{i-1} + a)
2375                 \end{array}
2376                 \right) $}
2377      \ENDFOR
2378   \ENDFOR
2379 \ENDFOR
2380 \end{algorithm}
2381 %
2382 To invert the algorithm we run the iterations backwards and invert the
2383 matrix multiplication in the innermost loop,
2384 %
2385 \begin{algorithm}
2386 \FOR {$i = n \dots 1$}
2387   \FOR {$a = 0 \dots p_{i-1}/2$}
2388     \FOR{$b = 0 \dots q_i - 1$}
2389         \STATE{$
2390                 \left(
2391                 \begin{array}{c}
2392                 g(2 b p_{i-1} + a) \\
2393                 g((2 b + 1) p_{i-1} + a)
2394                 \end{array}
2395                 \right)
2396                 =
2397                 \left(
2398                 \begin{array}{cc}
2399                 1 & W^a_{p_{i}} \\
2400                 1 & -W^a_{p_{i}}
2401                 \end{array}
2402                 \right)^{-1}
2403                 \left(
2404                 \begin{array}{c}
2405                 g(b p_i + a) \\
2406                 g(b p_i + p_{i-1} + a)
2407                 \end{array}
2408                 \right) $}
2409      \ENDFOR
2410   \ENDFOR
2411 \ENDFOR
2412 \end{algorithm}
2413 %
2414 There is no need to reverse the loops over $a$ and $b$ because the
2415 result is independent of their order. The inverse of the matrix that
2416 appears is,
2417 %
2418 \begin{equation}
2419 \left(
2420 \begin{array}{cc}
2421 1 & W^a_{p_{i}} \\
2422 1 & -W^a_{p_{i}}
2423 \end{array}
2424 \right)^{-1}
2425 =
2426 {1 \over 2}
2427 \left(
2428 \begin{array}{cc}
2429 1 & 1 \\
2430 W^{-a}_{p_{i}} & -W^{-a}_{p_{i}}
2431 \end{array}
2432 \right)
2433 \end{equation}
2434 %
2435 To save divisions we remove the factor of $1/2$ inside the loop. This
2436 computes the unnormalized inverse, and the normalized inverse can be
2437 retrieved by dividing the final result by $N = 2^n$.
2438
2439 Here is the radix-2 half-complex to real inverse FFT algorithm, taking
2440 into account the radix-2 storage scheme,
2441 %
2442 \begin{algorithm}
2443 \FOR {$i = n \dots 1$}
2444   \FOR{$b = 0 \dots q_i - 1$}
2445   \STATE{$\left(
2446           \begin{array}{c}
2447           g(b p_i) \\
2448           g(b p_i + p_{i-1})
2449           \end{array}
2450           \right)
2451           \Leftarrow
2452           \left(
2453           \begin{array}{c}
2454           g(b p_{i}) + g(b p_{i} + p_{i-1}) \\
2455           g(b p_{i}) - g(b p_{i} + p_{i-1})
2456           \end{array}
2457           \right)$}
2458   \ENDFOR
2459
2460   \FOR {$a = 1 \dots p_{i-1}/2 - 1$}
2461     \FOR{$b = 0 \dots q_i - 1$}
2462         \STATE{$(\Real z_0, \Imag z_0)
2463                 \Leftarrow
2464                 (g(b p_i + a), g(b p_i + p_{i} - a))$}
2465         \STATE{$(\Real z_1, \Imag z_1)
2466                 \Leftarrow
2467                 (g(b p_i + p_{i-1} - a),  -g(b p_i + p_{i-1} + a))$}
2468         \STATE{$t_0 \Leftarrow z_0 + z_1$}
2469         \STATE{$t_1 \Leftarrow z_0 - z_1$}
2470         \STATE{$(g(b p_{i} + a), g(b p_{i} + p_{i-1} - a))
2471                  \Leftarrow
2472                 (\Real t_0, \Imag t_0) $}
2473         \STATE{$(g(b p_{i} + p_{i-1} + a),g(b p_{i} + p_{i} - a))
2474                 \Leftarrow
2475                 (\Real(W^a_{p_i}t_1), \Imag(W^a_{p_i}t_1))$}
2476      \ENDFOR
2477   \ENDFOR
2478
2479   \FOR{$b = 0 \dots q_i - 1$}
2480         \STATE{$g(b p_{i} + p_{i-1}/2) \Leftarrow 2 g(b p_{i} + p_{i-1}/2)$}
2481         \STATE{$g(b p_{i} + p_{i-1} + p_{i-1}/2) \Leftarrow -2 g(b p_{i} + p_{i-1} + p_{i-1}/2)$}
2482   \ENDFOR
2483
2484 \ENDFOR
2485 \STATE bit-reverse ordering of $g$
2486 \end{algorithm}
2487
2488
2489
2490 \subsection{Mixed-Radix FFTs for real data}
2491 %
2492 As discussed earlier the radix-2 decimation-in-time algorithm had the
2493 special property that its intermediate passes are interleaved fourier
2494 transforms of the original data, and this generalizes to the
2495 mixed-radix algorithm. The complex mixed-radix algorithm that we
2496 derived earlier was a decimation-in-frequency algorithm, but we can
2497 obtain a decimation-in-time version by taking the transpose of the
2498 decimation-in-frequency DFT matrix like this,
2499 %
2500 \begin{eqnarray}
2501 W_N &=& W_N^T  \\
2502 &=& (T_{n_f} \dots T_2 T_1)^T \\
2503 &=& T_1^T T_2^T \dots T_{n_f}^T
2504 \end{eqnarray}
2505 %
2506 with,
2507 %
2508 \begin{eqnarray}
2509 T_i^T &=& \left( (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}})
2510         (W_{f_i} \otimes I_{m_i}) \right)^T \\
2511         &=&     (W_{f_i} \otimes I_{m_i})
2512                 ( D^{f_i}_{q_i} (P^{f_i}_{q_i})^T \otimes I_{p_{i-1}}).
2513 \end{eqnarray}
2514 %
2515 We have used the fact that $W$, $D$ and $I$ are symmetric and that the
2516 permutation matrix $P$ obeys,
2517 %
2518 \begin{equation}
2519 (P^a_b)^T = P^b_a.
2520 \end{equation}
2521 %
2522 From the definitions of $D$ and $P$ we can derive the following identity,
2523 %
2524 \begin{equation}
2525 D^a_b P^b_a = P^b_a D^b_a.
2526 \end{equation}
2527 %
2528 This allows us to put the transpose into a simple form,
2529 %
2530 \begin{equation}
2531 T_i^T =         (W_{f_i} \otimes I_{m_i})
2532                 (P^{q_i}_{f_i} D^{q_i}_{f_i} \otimes I_{p_{i-1}}).
2533 \end{equation}
2534 %
2535 The transposed matrix, $T^T$ applies the digit-reversal $P$ before the
2536 DFT $W$, giving the required decimation-in-time algorithm.  The
2537 transpose reverses the order of the factors --- $T_{n_f}$ is applied
2538 first and $T_1$ is applied last. For convenience we can reverse the
2539 order of the factors, $f_1 \leftrightarrow f_{n_f}$, $f_2
2540 \leftrightarrow f_{n_f-1}$, \dots and make the corresponding
2541 substitution $p_{i-1} \leftrightarrow q_i$. These substitutions give
2542 us a decimation-in-time algorithm with the same ordering as the
2543 decimation-in-frequency algorithm,
2544 %
2545 \begin{equation}
2546 W_N = T_{n_f} \dots T_2 T_1
2547 \end{equation}
2548 %
2549 \begin{equation}
2550 T_i = (W_{f_i} \otimes I_{m_i})
2551         (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})
2552 \end{equation}
2553 %
2554 where $p_i$, $q_i$ and $m_i$ now have the same meanings as before,
2555 namely,
2556 %
2557 \begin{eqnarray}
2558 p_i &=& f_1 f_2 \dots f_i \quad (p_0 = 1)  \\
2559 q_i &=& N / p_i \\
2560 m_i &=& N / f_i
2561 \end{eqnarray}
2562 %
2563 Now we would like to prove that the iteration for computing $x = W z =
2564 T_{n_f} \dots T_2 T_1 z$ has the special property interleaving
2565 property. If we write the result of each intermediate pass as
2566 $v^{(i)}$,
2567 %
2568 \begin{eqnarray}
2569 v^{(0)} &=& z \\
2570 v^{(1)} &=& T_1 v^{(0)} \\
2571 v^{(2)} &=& T_2 v^{(1)} \\
2572 \dots   &=& \dots \\
2573 v^{(i)} &=& T_i v^{(i-1)}
2574 \end{eqnarray}
2575 %
2576 then we will show that the intermediate results $v^{(i)}$ on any pass
2577 can be written as,
2578 %
2579 \begin{equation}
2580 v^{(i)} = (W_{p_i} \otimes I_{q_i}) z
2581 \end{equation}
2582 %
2583 Each intermediate stage will be a set of $q_i$ interleaved fourier
2584 transforms, each of length $p_i$. We can prove this result by
2585 induction. First we assume that the result is true for $v^{(i-1)}$,
2586 %
2587 \begin{equation}
2588 v^{(i-1)} = (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \qquad \mbox{(assumption)}
2589 \end{equation}
2590 %
2591 And then we examine the next iteration using this assumption,
2592 %
2593 \begin{eqnarray}
2594 v^{(i)} &=& T_i v^{(i-1)} \\
2595         &=& T_i (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \\
2596         &=& (W_{f_i} \otimes I_{m_i})
2597                 (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})
2598                 (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \label{dit-induction}
2599 \end{eqnarray}
2600 %
2601 Using the relation $m_i = p_{i-1} q_i$, we can write $I_{m_i}$ as
2602 $I_{p_{i-1} q_i}$ and $I_{q_{i-1}}$ as $I_{f_i q_i}$. By combining these
2603 with the basic matrix identity,
2604 %
2605 \begin{equation}
2606 I_{ab} = I_a \otimes I_b
2607 \end{equation}
2608 %
2609 we can rewrite $v^{(i)}$ in the following form,
2610 %
2611 \begin{equation}
2612 v^{(i)} =  (((W_{f_i} \otimes I_{p_{i-1}})
2613                 (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})
2614                 (W_{p_{i-1}} \otimes I_{f_i})) \otimes I_{q_i}) z .
2615 \end{equation}
2616 %
2617 The first part of this matrix product is the two-factor expansion of
2618 $W_{ab}$, for $a = p_{i-1}$ and $b = f_i$,
2619 %
2620 \begin{equation}
2621 W_{p_{i-1} f_i} = ((W_{f_i} \otimes I_{p_{i-1}})
2622                   (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})
2623                   (W_{p_{i-1}} \otimes I_{f_i})).
2624 \end{equation}
2625 %
2626 If we substitute this result, remembering that $p_i = p_{i-1} f_i$, we
2627 obtain,
2628 %
2629 \begin{equation}
2630 v^{(i)} = (W_{p_i} \otimes I_{q_i}) z
2631 \end{equation}
2632 %
2633 which is the desired result. The case $i=1$ can be verified
2634 explicitly, and induction then shows that the result is true for all
2635 $i$.  As discussed for the radix-2 algorithm this result is important
2636 because if the initial data $z$ is real then each intermediate pass is
2637 a set of interleaved fourier transforms of $z$, having half-complex
2638 symmetries (appropriately applied in the subspaces of the Kronecker
2639 product). Consequently only $N$ real numbers are needed to store the
2640 intermediate and final results.
2641
2642 \subsection{Implementation}
2643 %
2644 The implementation of the mixed-radix real FFT algorithm follows the
2645 same principles as the full complex transform. Some of the steps are
2646 applied in the opposite order because we are dealing with a decimation
2647 in time algorithm instead of a decimation in frequency algorithm, but
2648 the basic outer structure of the algorithm is the same. We want to
2649 apply the factorized version of the DFT matrix $W_N$ to the input
2650 vector $z$,
2651 %
2652 \begin{eqnarray}
2653 x &=& W_N z \\
2654   &=& T_{n_f} \dots T_2 T_1 z
2655 \end{eqnarray}
2656 %
2657 We loop over the $n_f$ factors, applying each matrix $T_i$ to the
2658 vector in turn to build up the complete transform,
2659 %
2660 \begin{algorithm}
2661 \FOR{$(i = 1 \dots n_f)$}
2662         \STATE{$v \Leftarrow T_i v $}
2663 \ENDFOR
2664 \end{algorithm}
2665 %
2666 In this case the definition of $T_i$ is different because we have
2667 taken the transpose,
2668 %
2669 \begin{equation}
2670 T_i =
2671   (W_{f_i} \otimes I_{m_i})
2672   (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}).
2673 \end{equation}
2674 %
2675 We'll define a temporary vector $t$ to denote the results of applying the
2676 rightmost matrix,
2677 %
2678 \begin{equation}
2679 t = (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}) v
2680 \end{equation}
2681 %
2682 If we expand this out into individual components, as before, we find a
2683 similar simplification,
2684 %
2685 \begin{eqnarray}
2686 t_{aq+b}
2687 &=&
2688 \sum_{a'b'}
2689 (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})_{(aq+b)(a'q+b')}
2690 v_{a'q+b'} \\
2691 &=&
2692 \sum_{a'}
2693 (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})_{a a'}
2694 v_{a'q+b}
2695 \end{eqnarray}
2696 %
2697 We have factorized the indices into the form $aq+b$, with $0 \leq a <
2698 p_{i}$ and $0 \leq b < q$.  Just as in the decimation in frequency
2699 algorithm we can split the index $a$ to remove the matrix
2700 multiplication completely. We have to write $a$ as $\mu f + \lambda$,
2701 where $0 \leq \mu < p_{i-1}$ and $0 \leq \lambda < f$,
2702 %
2703 \begin{eqnarray}
2704 t_{(\mu f + \lambda)q+b}
2705 &=&
2706 \sum_{\mu'\lambda'}
2707 (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})_{(\mu f + \lambda)(\mu' f + \lambda')}
2708 v_{(\mu' f + \lambda')q+b}
2709 \\
2710 &=&
2711 \sum_{\mu'\lambda'}
2712 (P^{p_{i-1}}_{f_i})_{(\mu f + \lambda)(\mu' f + \lambda')}
2713 \omega^{\mu'\lambda'}_{p_{i}}
2714 v_{(\mu' f + \lambda')q+b}
2715 \end{eqnarray}
2716 %
2717 The matrix $P^{p_{i-1}}_{f_i}$ exchanges an index of $(\mu f +
2718 \lambda) q + b$ with $(\lambda p_{i-1} + \mu) q + b$, giving a final
2719 result of,
2720 %
2721 \begin{eqnarray}
2722 t_{(\lambda p_{i-1} + \mu) q + b}
2723 =
2724 w^{\mu\lambda}_{p_i} v_{(\mu f + \lambda)q +b}
2725 \end{eqnarray}
2726 %
2727 To calculate the next stage,
2728 %
2729 \begin{equation}
2730 v' = (W_{f_i} \otimes I_{m_i}) t,
2731 \end{equation}
2732 %
2733 we temporarily rearrange the index of $t$ to separate the $m_{i}$
2734 independent DFTs in the Kronecker product,
2735 %
2736 \begin{equation}
2737 v'_{(\lambda p_{i-1} + \mu) q + b}
2738 =
2739 \sum_{\lambda' \mu' b'}
2740 (W_{f_i} \otimes I_{m_i})_{
2741 ((\lambda p_{i-1} + \mu) q + b)
2742 ((\lambda' p_{i-1} + \mu') q + b')}
2743 t_{(\lambda' p_{i-1} + \mu') q + b'}
2744 \end{equation}
2745 %
2746 If we use the identity $m = p_{i-1} q$ to rewrite the index of $t$
2747 like this,
2748 %
2749 \begin{equation}
2750 t_{(\lambda p_{i-1} + \mu) q + b} = t_{\lambda m + (\mu q + b)}
2751 \end{equation}
2752 %
2753 we can split the Kronecker product,
2754 %
2755 \begin{eqnarray}
2756 v'_{(\lambda p_{i-1} + \mu) q + b}
2757 &=&
2758 \sum_{\lambda' \mu' b'}
2759 (W_{f_i} \otimes I_{m_i})_{
2760 ((\lambda p_{i-1} + \mu) q + b)
2761 ((\lambda' p_{i-1} + \mu') q + b')}
2762 t_{(\lambda' p_{i-1} + \mu') q + b'}\\
2763 &=&
2764 \sum_{\lambda'}
2765 (W_{f_i})_{\lambda \lambda'}
2766 t_{\lambda' m_i + (\mu q + b)}
2767 \end{eqnarray}
2768 %
2769 If we switch back to the original form of the index in the last line we obtain,
2770 %
2771 \begin{eqnarray}
2772 \phantom{v'_{(\lambda p_{i-1} + \mu) q + b}}
2773 &=&
2774 \sum_{\lambda'}
2775 (W_{f_i})_{\lambda \lambda'}
2776 t_{(\lambda p_{i-1} + \mu) q + b}
2777 \end{eqnarray}
2778 %
2779 which allows us to substitute our previous result for $t$,
2780 %
2781 \begin{equation}
2782 v'_{(\lambda p_{i-1} + \mu) q + b}
2783 =
2784 \sum_{\lambda'}
2785 (W_{f_i})_{\lambda \lambda'}
2786 w^{\mu\lambda'}_{p_i} v_{(\mu f + \lambda')q + b}
2787 \end{equation}
2788 %
2789 This gives us the basic decimation-in-time mixed-radix algorithm for
2790 complex data which we will be able to specialize to real data,
2791 %
2792 \begin{algorithm}
2793 \FOR{$i = 1 \dots n_f$}
2794 \FOR{$\mu = 0 \dots p_{i-1} - 1$}
2795 \FOR{$b = 0 \dots q-1$}
2796 \FOR{$\lambda = 0 \dots f-1$}
2797 \STATE{$t_\lambda \Leftarrow
2798                \omega^{\mu\lambda'}_{p_{i}} v_{(\mu f + \lambda')q + b}$}
2799 \ENDFOR
2800 \FOR{$\lambda = 0 \dots f-1$}
2801 \STATE{$v'_{(\lambda p_{i-1} + \mu)q + b} =
2802        \sum_{\lambda'=0}^{f-1}  W_f(\lambda,\lambda') t_{\lambda'}$}
2803        \COMMENT{DFT matrix-multiply module}
2804 \ENDFOR
2805 \ENDFOR
2806 \ENDFOR
2807 \STATE{$v \Leftarrow v'$}
2808 \ENDFOR
2809 \end{algorithm}
2810 %
2811 We are now at the point where we can convert an algorithm formulated
2812 in terms of complex variables to one in terms of real variables by
2813 choosing a suitable storage scheme.  We will adopt the FFTPACK storage
2814 convention. FFTPACK uses a scheme where individual FFTs are
2815 contiguous, not interleaved, and real-imaginary pairs are stored in
2816 neighboring locations. This has better locality than was possible for
2817 the radix-2 case.
2818
2819 The interleaving of the intermediate FFTs results from the Kronecker
2820 product, $W_p \otimes I_q$. The FFTs can be made contiguous if we
2821 reorder the Kronecker product on the intermediate passes,
2822 %
2823 \begin{equation}
2824 W_p \otimes I_q \Rightarrow I_q \otimes W_p
2825 \end{equation}
2826 %
2827 This can be implemented by a simple change in indexing.  On pass-$i$
2828 we store element $v_{a q_i + b}$ in location $v_{b p_i+a}$. We
2829 compensate for this change by making the same transposition when
2830 reading the data. Note that this only affects the indices of the
2831 intermediate passes.  On the zeroth iteration the transposition has no
2832 effect because $p_0 = 1$. Similarly there is no effect on the last
2833 iteration, which has $q_{n_f} = 1$. This is how the algorithm above
2834 looks after this index transformation,
2835 %
2836 \begin{algorithm}
2837 \FOR{$i = 1 \dots n_f$}
2838 \FOR{$\mu = 0 \dots p_{i-1} - 1$}
2839 \FOR{$b = 0 \dots q-1$}
2840 \FOR{$\lambda = 0 \dots f-1$}
2841 \STATE{$t_\lambda \Leftarrow
2842                \omega^{\mu\lambda'}_{p_{i}} v_{(\lambda'q + b)p_{i-1} + \mu}$}
2843 \ENDFOR
2844 \FOR{$\lambda = 0 \dots f-1$}
2845 \STATE{$v'_{b p + (\lambda p_{i-1} + \mu)} =
2846        \sum_{\lambda'=0}^{f-1}  W_f(\lambda,\lambda') t_{\lambda'}$}
2847        \COMMENT{DFT matrix-multiply module}
2848 \ENDFOR
2849 \ENDFOR
2850 \ENDFOR
2851 \STATE{$v \Leftarrow v'$}
2852 \ENDFOR
2853 \end{algorithm}
2854 %
2855 We transpose the input terms by writing the index in the form $a
2856 q_{i-1} + b$, to take account of the pass-dependence of the scheme,
2857 %
2858 \begin{equation}
2859 v_{(\mu f + \lambda')q + b} = v_{\mu q_{i-1} + \lambda'q + b}
2860 \end{equation}
2861 %
2862 We used the identity $q_{i-1} = f q$ to split the index. Note that in
2863 this form $\lambda'q + b$ runs from 0 to $q_{i-1} - 1$ as $b$ runs
2864 from 0 to $q-1$ and $\lambda'$ runs from 0 to $f-1$. The transposition
2865 for the input terms then gives,
2866 %
2867 \begin{equation}
2868 v_{\mu q_{i-1} + \lambda'q + b} \Rightarrow  v_{(\lambda'q + b) p_{i-1} + \mu}
2869 \end{equation}
2870 %
2871 In the FFTPACK scheme the intermediate output data have the same
2872 half-complex symmetry as the radix-2 example, namely,
2873 %
2874 \begin{equation}
2875 v^{(i)}_{b p + a} = v^{(i)*}_{b p + (p - a)}
2876 \end{equation}
2877 %
2878 and on the input data from the previous pass have the symmetry,
2879 %
2880 \begin{equation}
2881 v^{(i-1)}_{(\lambda' q + b) p_{i-1} + \mu} = v^{(i-1)*}_{(\lambda' q +
2882 b) p_{i-1} + (p_{i-1} - \mu)}
2883 \end{equation}
2884 %
2885 Using these symmetries we can halve the storage and computation
2886 requirements for each pass. Compared with the radix-2 algorithm we
2887 have more freedom because the computation does not have to be done in
2888 place. The storage scheme adopted by FFTPACK places elements
2889 sequentially with real and imaginary parts in neighboring
2890 locations. Imaginary parts which are known to be zero are not
2891 stored. Here are the full details of the scheme,
2892 %
2893 \begin{center}
2894 \renewcommand{\arraystretch}{1.5}
2895 \begin{tabular}{|l|lll|}
2896 \hline Term & & Location & \\
2897 \hline
2898 $g(b p_i)$
2899         & real part & $b p_{i} $ & \\
2900         & imag part & --- & \\
2901 \hline
2902 $g(b p_i + a)$
2903         & real part & $b p_{i} + 2a - 1 $& for $a = 1 \dots p_i/2 - 1$ \\
2904         & imag part & $b p_{i} + 2a$ & \\
2905 \hline
2906 $g(b p_{i} + p_{i}/2)$
2907         & real part & $b p_i + p_{i} - 1$ & if $p_i$ is even\\
2908         & imag part & --- & \\
2909 \hline
2910 \end{tabular}
2911 \end{center}
2912 %
2913 The real element for $a=0$ is stored in location $bp$.  The real parts
2914 for $a = 1 \dots p/2 - 1$ are stored in locations $bp + 2a -1$ and the
2915 imaginary parts are stored in locations $b p + 2 a$.  When $p$ is even
2916 the term for $a = p/2$ is purely real and we store it in location $bp
2917 + p - 1$. The zero imaginary part is not stored.
2918
2919 When we compute the basic iteration,
2920 %
2921 \begin{equation}
2922 v^{(i)}_{b p + (\lambda p_{i-1} + \mu)} = \sum_{\lambda'}
2923 W^{\lambda \lambda'}_f
2924 \omega^{\mu\lambda'}_{p_i} v^{(i-1)}_{(\lambda' q + b)p_{i-1} + \mu}
2925 \end{equation}
2926 %
2927 we eliminate the redundant conjugate terms with $a > p_{i}/2$ as we
2928 did in the radix-2 algorithm. Whenever we need to store a term with $a
2929 > p_{i}/2$ we consider instead the corresponding conjugate term with
2930 $a' = p - a$. Similarly when reading data we replace terms with $\mu >
2931 p_{i-1}/2$ with the corresponding conjugate term for $\mu' = p_{i-1} -
2932 \mu$.
2933
2934 Since the input data on each stage has half-complex symmetry we only
2935 need to consider the range $\mu=0 \dots p_{i-1}/2$. We can consider
2936 the best ways to implement the basic iteration for each pass, $\mu = 0
2937 \dots p_{i-1}/2$.
2938
2939 On the first pass where $\mu=0$ we will be accessing elements which
2940 are the zeroth components of the independent transforms $W_{p_{i-1}}
2941 \otimes I_{q_{i-1}}$, and are purely real.
2942 %
2943 We can code the pass with $\mu=0$ as a special case for real input
2944 data, and conjugate-complex output. When $\mu=0$ the twiddle factors
2945 $\omega^{\mu\lambda'}_{p_i}$ are all unity, giving a further saving.
2946 We can obtain small-$N$ real-data DFT modules by removing the
2947 redundant operations from the complex modules.
2948 %
2949 For example the $N=3$ module was,
2950 %
2951 \begin{equation}
2952 \begin{array}{lll}
2953 t_1 = z_1 + z_2, &
2954 t_2 = z_0 - t_1/2, &
2955 t_3 = \sin(\pi/3) (z_1 - z_2),
2956 \end{array}
2957 \end{equation}
2958 \begin{equation}
2959 \begin{array}{lll}
2960 x_0 = z_0 + t_1, &
2961 x_1 = t_2 + i t_3, &
2962 x_2 = t_2 - i t_3.
2963 \end{array}
2964 \end{equation}
2965 %
2966 In the complex case all the operations were complex, for complex input
2967 data $z_0$, $z_1$, $z_2$. In the real case $z_0$, $z_1$ and $z_2$ are
2968 all real. Consequently $t_1$, $t_2$ and $t_3$ are also real, and the
2969 symmetry $x_1 = t_1 + i t_2 = x^*_2$ means that we do not have to
2970 compute $x_2$ once we have computed $x_1$.
2971
2972 For subsequent passes $\mu = 1 \dots p_{i-1}/2 - 1$ the input data is
2973 complex and we have to compute full complex DFTs using the same
2974 modules as in the complex case. Note that the inputs are all of the
2975 form $v_{(\lambda q + b) p_{i-1} + \mu}$ with $\mu < p_{i-1}/2$ so we
2976 never need to use the symmetry to access the conjugate elements with
2977 $\mu > p_{i-1}/2$.
2978
2979 If $p_{i-1}$ is even then we reach the halfway point $\mu=p_{i-1}/2$,
2980 which is another special case. The input data in this case is purely
2981 real because $\mu = p_{i-1} - \mu$ for $\mu = p_{i-1}/2$. We can code
2982 this as a special case, using real inputs and real-data DFT modules as
2983 we did for $\mu=0$. However, for $\mu = p_{i-1}/2$ the twiddle factors
2984 are not unity,
2985 %
2986 \begin{eqnarray}
2987 \omega^{\mu\lambda'}_{p_i} &=& \omega^{(p_{i-1}/2)\lambda'}_{p_i} \\
2988 &=& \exp(-i\pi\lambda'/f_i)
2989 \end{eqnarray}
2990 %
2991 These twiddle factors introduce an additional phase which modifies the
2992 symmetry of the outputs. Instead of the conjugate-complex symmetry
2993 which applied for $\mu=0$ there is a shifted conjugate-complex
2994 symmetry,
2995 %
2996 \begin{equation}
2997 t_\lambda = t^*_{f-(\lambda+1)}
2998 \end{equation}
2999 %
3000 This is easily proved,
3001 %
3002 \begin{eqnarray}
3003 t_\lambda
3004 &=&
3005 \sum e^{-2\pi i \lambda\lambda'/f_i} e^{-i\pi \lambda'/f_i} r_{\lambda'} \\
3006 t_{f - (\lambda + 1)}
3007 &=&
3008 \sum e^{-2\pi i (f-\lambda-1)\lambda'/f_i} e^{-i\pi \lambda'/f_i} r_{\lambda'} \\
3009 &=&
3010 \sum e^{2\pi i \lambda\lambda'/f_i} e^{i\pi \lambda'/f_i} r_{\lambda'} \\
3011 &=& t^*_\lambda
3012 \end{eqnarray}
3013 %
3014 The symmetry of the output means that we only need to compute half of
3015 the output terms, the remaining terms being conjugates or zero
3016 imaginary parts. For example, when $f=4$ the outputs are $(x_0 + i
3017 y_0, x_1 + i y_1, x_1 - i y_1, x_0 - i y_0)$. For $f=5$ the outputs
3018 are $(x_0 + i y_0, x_1 + i y_1, x_2, x_1 - i y_1, x_0 - i y_0)$. By
3019 combining the twiddle factors and DFT matrix we can make a combined
3020 module which applies both at the same time. By starting from the
3021 complex DFT modules and bringing in twiddle factors we can derive
3022 optimized modules. Here are the modules given by Temperton for $z = W
3023 \Omega x$ where $x$ is real and $z$ has the shifted conjugate-complex
3024 symmetry. The matrix of twiddle factors, $\Omega$, is given by,
3025 %
3026 \begin{equation}
3027 \Omega = \mathrm{diag}(1, e^{-i\pi/f}, e^{-2\pi i/f}, \dots, e^{-i\pi(f-1)/f})
3028 \end{equation}
3029 %
3030 We write $z$ in terms of two real vectors $z = a + i b$.
3031 %
3032 For $N=2$,
3033 %
3034 \begin{equation}
3035 \begin{array}{ll}
3036 a_0 = x_0, &
3037 b_0 = - x_1.
3038 \end{array}
3039 \end{equation}
3040 %
3041 For $N=3$,
3042 %
3043 \begin{equation}
3044 \begin{array}{l}
3045 t_1 = x_1 - x_2,
3046 \end{array}
3047 \end{equation}
3048 \begin{equation}
3049 \begin{array}{ll}
3050 a_0 = x_0 + t_1/2, & b_0 = x_0 - t_1,
3051 \end{array}
3052 \end{equation}
3053 \begin{equation}
3054 \begin{array}{l}
3055 a_1 = - \sin(\pi/3) (x_1 + x_2)
3056 \end{array}
3057 \end{equation}
3058 %
3059 For $N=4$,
3060 %
3061 \begin{equation}
3062 \begin{array}{ll}
3063 t_1 = (x_1 - x_3)/\sqrt{2}, & t_2 = (x_1 + x_3)/\sqrt{2},
3064 \end{array}
3065 \end{equation}
3066 \begin{equation}
3067 \begin{array}{ll}
3068 a_0 = x_0 + t_1, & b_0 = -x_2 - t_2,
3069 \end{array}
3070 \end{equation}
3071 \begin{equation}
3072 \begin{array}{ll}
3073 a_1 = x_0 - t_1, & b_1 = x_2 - t_2.
3074 \end{array}
3075 \end{equation}
3076 %
3077 For $N=5$,
3078 %
3079 \begin{equation}
3080 \begin{array}{ll}
3081 t_1 = x_1 - x_4, & t_2 = x_1 + x_4,
3082 \end{array}
3083 \end{equation}
3084 \begin{equation}
3085 \begin{array}{ll}
3086 t_3 = x_2 - x_3, & t_4 = x_2 + x_3,
3087 \end{array}
3088 \end{equation}
3089 \begin{equation}
3090 \begin{array}{ll}
3091 t_5 = t_1 - t_3, & t_6 = x_0 + t_5 / 4,
3092 \end{array}
3093 \end{equation}
3094 \begin{equation}
3095 \begin{array}{ll}
3096 t_7 = (\sqrt{5}/4)(t_1 + t_3) &
3097 \end{array}
3098 \end{equation}
3099 \begin{equation}
3100 \begin{array}{ll}
3101 a_0 = t_6 + t_7, & b_0 = -\sin(2\pi/10) t_2 - \sin(2\pi/5) t_4,
3102 \end{array}
3103 \end{equation}
3104 \begin{equation}
3105 \begin{array}{ll}
3106 a_1 = t_6 - t_7, & b_1 = -\sin(2\pi/5) t_2 + \sin(2\pi/10) t_4,
3107 \end{array}
3108 \end{equation}
3109 \begin{equation}
3110 \begin{array}{ll}
3111 a_2 = x_0 - t_5 &
3112 \end{array}
3113 \end{equation}
3114 %
3115 For $N=6$,
3116 %
3117 \begin{equation}
3118 \begin{array}{ll}
3119 t_1 = \sin(\pi/3)(x_5 - x_1), & t_2 = \sin(\pi/3) (x_2 + x_4),
3120 \end{array}
3121 \end{equation}
3122 \begin{equation}
3123 \begin{array}{ll}
3124 t_3 = x_2 - x_4, & t_4 = x_1 + x_5,
3125 \end{array}
3126 \end{equation}
3127 \begin{equation}
3128 \begin{array}{ll}
3129 t_5 = x_0 + t_3 / 2, & t_6 = -x_3 - t_4 / 2,
3130 \end{array}
3131 \end{equation}
3132 \begin{equation}
3133 \begin{array}{ll}
3134 a_0 = t_5 - t_1, & b_0 = t_6 - t_2,
3135 \end{array}
3136 \end{equation}
3137 \begin{equation}
3138 \begin{array}{ll}
3139 a_1 = x_0 - t_3, & b_1 = x_3 - t_4,
3140 \end{array}
3141 \end{equation}
3142 \begin{equation}
3143 \begin{array}{ll}
3144 a_2 = t_5 + t_1, & b_2 = t_6 + t_2
3145 \end{array}
3146 \end{equation}
3147
3148 \section{Computing the mixed-radix inverse for real data}
3149 %
3150 To compute the inverse of the mixed-radix FFT on real data we step
3151 through the algorithm in reverse and invert each operation.
3152
3153 This gives the following algorithm using FFTPACK indexing,
3154 %
3155 \begin{algorithm}
3156 \FOR{$i = n_f \dots 1$}
3157 \FOR{$\mu = 0 \dots p_{i-1} - 1$}
3158 \FOR{$b = 0 \dots q-1$}
3159 \FOR{$\lambda = 0 \dots f-1$}
3160 \STATE{$t_{\lambda'} =
3161        \sum_{\lambda'=0}^{f-1}  W_f(\lambda,\lambda')
3162        v_{b p + (\lambda p_{i-1} + \mu)}$}
3163        \COMMENT{DFT matrix-multiply module}
3164 \ENDFOR
3165 \FOR{$\lambda = 0 \dots f-1$}
3166 \STATE{$v'_{(\lambda'q + b)p_{i-1} + \mu} \Leftarrow
3167                \omega^{-\mu\lambda'}_{p_{i}} t_\lambda$}
3168 \ENDFOR
3169
3170 \ENDFOR
3171 \ENDFOR
3172 \STATE{$v \Leftarrow v'$}
3173 \ENDFOR
3174 \end{algorithm}
3175 %
3176 When $\mu=0$ we are applying an inverse DFT to half-complex data,
3177 giving a real result. The twiddle factors are all unity. We can code
3178 this as a special case, just as we did for the forward routine. We
3179 start with complex module and eliminate the redundant terms. In this
3180 case it is the final result which has the zero imaginary part, and we
3181 eliminate redundant terms by using the half-complex symmetry of the
3182 input data.
3183
3184 When $\mu=1 \dots p_{i-1}/2 - 1$ we have full complex transforms on
3185 complex data, so no simplification is possible.
3186
3187 When $\mu = p_{i-1}/2$ (which occurs only when $p_{i-1}$ is even) we
3188 have a combination of twiddle factors and DFTs on data with the
3189 shifted half-complex symmetry which give a real result. We implement
3190 this as a special module, essentially by inverting the system of
3191 equations given for the forward case. We use the modules given by
3192 Temperton, appropriately modified for our version of the algorithm. He
3193 uses a slightly different convention which differs by factors of two
3194 for some terms (consult his paper for details~\cite{temperton83real}).
3195
3196 For $N=2$,
3197 %
3198 \begin{equation}
3199 \begin{array}{ll}
3200 x_0 = 2 a_0, & x_1 = - 2 b_0 .
3201 \end{array}
3202 \end{equation}
3203 %
3204 For $N=3$,
3205 %
3206 \begin{equation}
3207 \begin{array}{ll}
3208 t_1 = a_0 - a_1, & t_2 = \sqrt{3} b_1, \\
3209 \end{array}
3210 \end{equation}
3211 \begin{equation}
3212 \begin{array}{lll}
3213 x_0 = 2 a_0 + a_1, & x_1 = t_1 - t_2, & x_2 = - t_1 - t_2
3214 \end{array}
3215 \end{equation}
3216 %
3217 For $N=4$,
3218 \begin{equation}
3219 \begin{array}{ll}
3220 t_1 = \sqrt{2} (b_0 + b_1), & t_2 = \sqrt{2} (a_0 - a_1),
3221 \end{array}
3222 \end{equation}
3223 \begin{equation}
3224 \begin{array}{ll}
3225 x_0 = 2(a_0 + a_1), & x_1 = t_2 - t_1 ,
3226 \end{array}
3227 \end{equation}
3228 \begin{equation}
3229 \begin{array}{ll}
3230 x_2 = 2(b_1 - b_0), & x_3 = -(t_2 + t_1).
3231 \end{array}
3232 \end{equation}
3233 %
3234 For $N=5$,
3235 %
3236 \begin{equation}
3237 \begin{array}{ll}
3238 t_1 = 2 (a_0 + a_1), & t_2 = t_1 / 4 - a_2,
3239 \end{array}
3240 \end{equation}
3241 \begin{equation}
3242 \begin{array}{ll}
3243 t_3 = (\sqrt{5}/2) (a_0 - a_1),
3244 \end{array}
3245 \end{equation}
3246 \begin{equation}
3247 \begin{array}{l}
3248 t_4 = 2(\sin(2\pi/10) b_0 + \sin(2\pi/5) b_1),
3249 \end{array}
3250 \end{equation}
3251 \begin{equation}
3252 \begin{array}{l}
3253 t_5 = 2(\sin(2\pi/10) b_0 - \sin(2\pi/5) b_1),
3254 \end{array}
3255 \end{equation}
3256 \begin{equation}
3257 \begin{array}{ll}
3258 t_6 = t_3 + t_2, & t_7 = t_3 - t_2,
3259 \end{array}
3260 \end{equation}
3261 \begin{equation}
3262 \begin{array}{ll}
3263 x_0 = t_1 + a_2, & x_1 = t_6 - t_4 ,
3264 \end{array}
3265 \end{equation}
3266 \begin{equation}
3267 \begin{array}{ll}
3268 x_2 = t_7 - t_5, & x_3 = - t_7 - t_5,
3269 \end{array}
3270 \end{equation}
3271 \begin{equation}
3272 \begin{array}{ll}
3273 x_4 = -t_6 - t_4.
3274 \end{array}
3275 \end{equation}
3276
3277 \section{Conclusions}
3278 %
3279 We have described the basic algorithms for one-dimensional radix-2 and
3280 mixed-radix FFTs. It would be nice to have a pedagogical explanation
3281 of the split-radix FFT algorithm, which is faster than the simple
3282 radix-2 algorithm we used. We could also have a whole chapter on
3283 multidimensional FFTs.
3284 %
3285 %\section{Multidimensional FFTs}
3286 %\section{Testing FFTs, Numerical Analysis}
3287
3288 %\nocite{*}
3289 \bibliographystyle{unsrt}
3290 \bibliography{fftalgorithms}
3291
3292 \end{document}
3293
3294
3295
3296