2D FFT: A 512x512 complex matrix is initialized with a point source and then decomposed and distributed to each processor in the partition (by rows for C; by columns for Fortran). Each processor then performs a one-dimensional FFT on it's portion of the matrix which is stored locally. Each processor transposes it's portion of the matrix and performs an "All to all" distribution to the other processors in the partition. This now partitions the intermediate matrix by columns for C; rows for Fortran. Each processor then performs a one-dimensional FFT on it's portion of the matrix. Finally, the columns/rows of the matrix are gathered back at the destination processor and timing and Mflop results are displayed. C programs perform an additional test for correctness.
Note: A straightforward unsophisticated 1D FFT kernel is used. It is sufficient to convey the general idea, but be aware that there are better 1D FFTs available on many systems.