This is a very simple FFT, I am wondering what I can do to make this faster and more memory efficient from the programming side (better data types, and maybe some tricks like unrolling loops or using the preprocessor if that is useful here), and not by using a more efficient mathematical algorithm. Obviously I would appreciate advice on best practices as well.
#include <stdio.h>
#include <vector>
#include <iostream>
#include <complex>
#include <cmath>
#include <algorithm>
#define N 1048576
#define PI 3.14159265358979323846
/*
Creating the table of all N'th roots of unity.
We use notation omega_k = e^(-2 pi i / n).
*/
template<typename U>
std::vector< std::complex<U> > rootsOfUnityCalculator() {
std::vector< std::complex<U> > table;
for (size_t k = 0; k < N; k++) {
std::complex<U> kthRootOfUnity(std::cos(-2.0 * PI * k / N), std::sin(-2.0 * PI * k / N));
table.push_back(kthRootOfUnity);
}
return table;
}
/*
Fast Fourier transform, T is the precision level, so float or double.
table is a look up table of the roots of unity. Overwrites the input.
For now only works for N a power of 2.
*/
template<typename T>
void FFT(std::complex<T>* input, const std::vector< std::complex<T> >& table, size_t n) {
if (n % 2 == 0) {
// Split up the input in even and odd components
std::complex<T>* evenComponents = new std::complex<T>[n/2];
std::complex<T>* oddComponents = new std::complex<T>[n/2];
for (size_t k = 0; k < n/2; k++) {
evenComponents[k] = input[2 * k];
oddComponents[k] = input[2 * k + 1];
}
// Transform the even and odd input
FFT(evenComponents, table, n/2);
FFT(oddComponents, table, n/2);
// Use the algorithm from Danielson and Lanczos
for (size_t k = 0; k < n/2; k++) {
std::complex<T> plusMinus = table[N / n * k] * oddComponents[k]; // omega_n^k = (omega_N^(N/n))^k = omega_N^(Nk/n)
input[k] = evenComponents[k] + plusMinus;
input[k + n/2] = evenComponents[k] - plusMinus;
}
delete[] evenComponents;
delete[] oddComponents;
} else {
// The Fourier transform on one element does not do anything, so
// nothing needed here.
}
}
int main() {
std::complex<double>* input = new std::complex<double>[N];
for (size_t k = 0; k < N; k++) {
input[k] = k;
}
const std::vector< std::complex<double> > table = rootsOfUnityCalculator<double>();
// Overwrites the input with its Fourier transform
FFT<double>(input, table, N);
delete[] input;
return 0;
}
-
1\$\begingroup\$ You might also like the jjj.de/fxt/#fxtbook \$\endgroup\$user541686– user5416862021年03月02日 03:38:10 +00:00Commented Mar 2, 2021 at 3:38
-
1\$\begingroup\$ "What I can do to make this faster and more memory efficient from the programming side" -- Truly fast FFTs are several orders of magnitude faster than naive ones, and the techniques to get that performance are a better fit for a textbook than a StackExchange answer. Look at the Halide FFT or FFTW to see what goes into this.. \$\endgroup\$Alex Reinking– Alex Reinking2021年03月03日 23:27:09 +00:00Commented Mar 3, 2021 at 23:27
3 Answers 3
Avoid excessive memory usage
At each step of the recursion, you allocate two arrays that together are as large as the input. That means that your algorithm uses \$\mathcal{O}(N \log N)\$ space. It should be possible to rewrite your code so you only use \$\mathcal{O}(N)\$ space without substantially changing the algorithm. But this also brings me to:
In-place vs. out-of-place algorithms
While the function signature makes it looks like you are doing an in-place FFT transform, you are actually doing an out-of-place transform (using extra memory), and then copying the result back into the input array. However, there are also real in-place FFT algorithms that, apart from a few variables, do not need any extra memory, and by their nature overwrite the input array. This means they use only \$\mathcal{O}(1)\$ memory. The drawback of the in-place algorithms is that the transformed data is not in the order you expect; you might have to use a bit-reversed index to access the data. However, for some applications this doesn't matter at all, for example if you perform convolutions using FFT.
If you do use an out-of-place algorithm then I suggest you make this explicit, and either return a std::vector
with the results, or add a parameter to the function that tells it where to store the results.
-
\$\begingroup\$ It's ironic that the in-place algorithms may leave the output in an order you do not expect -- i.e. not in it's place ! \$\endgroup\$JosephDoggie– JosephDoggie2021年03月03日 14:05:15 +00:00Commented Mar 3, 2021 at 14:05
Prefer C++ headers (such as <cstdio>
) to the C compatibility headers (<stdio.h>
). The C++ headers define identifiers in the std
namespace where we want them.
It seems that this header isn't even used here, so we can omit it completely.
std::size_t
is misspellt throughout as size_t
. This commonly happens when writing on a platform that declares it in the global namespace as well as in std
, but that's not a portable assumption.
We only use PI
as a double, so we can declare it as a constant rather than as a macro:
constexpr double pi = 3.14159265358979323846;
Or to make it more obviously correct:
const double pi = std::acos(-1.0);
We can avoid naked new[]
and delete[]
using std::vector
. If we use std::vector::reserve
before adding any elements, there should be no performance impact, only improved code safety.
Using a vector saves having to pass the array size explicitly, too.
A helpful technique to determine whether a number is a power of two is that bitwise-and of the number and its arithmetic predecessor is zero for powers of two:
constexpr bool is_power_of_two(unsigned n)
{
return n && (n & (n-1) == 0);
}
A single value of N
is probably not very useful in general. We could make it a template parameter, or a simple function argument, to make our code useful for other sizes of input.
Modified version
I split FFT()
into a simple public function that ensures arguments are valid and the internal recursive implementation. I've used a requires
to constrain the complex type we accept (integer types aren't useful here), but that can be omitted for plain C++17.
#include <complex>
#include <vector>
namespace internal
{
/*
Creating the table of all N'th roots of unity.
We use notation ω_i = e^(-2πi/n).
*/
template<typename U>
constexpr auto rootsOfUnity(std::size_t n)
{
constexpr auto pi = std::acos(U{-1.0});
std::vector<std::complex<U>> table;
table.reserve(n);
for (std::size_t i = 0; i < n; ++i) {
table.emplace_back(std::cos(-2.0 * pi * i / n), std::sin(-2.0 * pi * i / n));
}
return table;
}
template<typename V> // V is vector of complex
void FFT(V& input, const V& table, std::size_t total_size)
{
if (input.size() <= 1) {
// The Fourier transform on one element does not do
// anything, so nothing needed here.
return;
}
const auto n2 = input.size() / 2;
// Split up the input in even and odd components
V evenComponents; evenComponents.reserve(n2);
V oddComponents; oddComponents.reserve(n2);
for (std::size_t i = 0; i < input.size(); ) {
evenComponents.push_back(input[i++]);
oddComponents.push_back(input[i++]);
}
// Transform the even and odd input
FFT(evenComponents, table, total_size);
FFT(oddComponents, table, total_size);
// Use the algorithm from Danielson and Lanczos
for (std::size_t i = 0; i < n2; ++i) {
// ω_n^i = (ω_N^(N/n))^i = ω_N^(Nk/n)
auto const plusMinus = table[total_size / n2 * i] * oddComponents[i];
input[i] = evenComponents[i] + plusMinus;
input[i + n2] = evenComponents[i] - plusMinus;
}
}
}
/*
Fast Fourier transform
T is the precision level: float, double, long double.
Overwrites the input.
Only works for input sizes that are a power of 2.
*/
template<typename T>
requires std::is_floating_point<T>
void FFT(std::vector<std::complex<T>>& input)
{
if (input.size() == 0) {
throw std::invalid_argument("input is empty");
}
if (input.size() & (input.size() - 1)) {
throw std::invalid_argument("input length is not a power of two");
}
auto const table = internal::rootsOfUnity<T>(input.size());
internal::FFT(input, table, input.size());
}
// Test program
#include <iostream>
int main()
{
static constexpr size_t n = 1048576;
std::vector<std::complex<double>> input;
for (std::size_t i = 0; i < n; ++i) {
input.emplace_back(i);
}
// Overwrite input with its Fourier transform
FFT<double>(input);
// use the result, to avoid optimising out
std::cout << input[0] << '\n';
}
-
2\$\begingroup\$ Could you add some speed comparisons? \$\endgroup\$kelalaka– kelalaka2021年03月01日 19:29:21 +00:00Commented Mar 1, 2021 at 19:29
-
11\$\begingroup\$ my preference for pi has been to initialise it using
std::acos(-1.0)
since -1.0 is exactly representable and obviously correct compared to a long constant that could have a mistake hiding at a digit I don't have memorised \$\endgroup\$Flexo - Save the data dump– Flexo - Save the data dump2021年03月01日 19:29:32 +00:00Commented Mar 1, 2021 at 19:29 -
3\$\begingroup\$ @kelalaka, the speed is identical to the original, to within 1 standard deviation of my measurements. I was focusing on robustness; see other answer for performance improvements. \$\endgroup\$Toby Speight– Toby Speight2021年03月01日 21:05:29 +00:00Commented Mar 1, 2021 at 21:05
-
2\$\begingroup\$ @TobySpeight Is there a reason for
throw **new** <exception>
? Would it not simplify exception handling to omit thenew
? \$\endgroup\$radioflash– radioflash2021年03月02日 09:42:12 +00:00Commented Mar 2, 2021 at 9:42 -
4\$\begingroup\$ C++20 finally has
std::numbers::pi
so we don't need to define it ourselves. \$\endgroup\$0x5453– 0x54532021年03月03日 15:12:43 +00:00Commented Mar 3, 2021 at 15:12
Use unique_ptr
instead of new[]
/delete[]
std::complex<double>* input = new std::complex<double>[N];
...
delete[] input;
This line can be replaced by appropriate use of std::unique_ptr
:
auto input_storage = std::make_unique<std::complex<double>[]>(N); // #include <memory>
std::complex<double>* input = input_storage.get();
The memory is automatically reclaimed once input_storage
goes out of scope.
I originally suggested unique_ptr
here because it's the moral equivalent of your code (has identical run-time behavior), but there are other options that might be more suitable for your situation:
std::vector<T>
(referenced in another answer):More flexible than
unique_ptr<T[]>
since you canresize
andpush_back
.Keeps track of its own length and capacity, unlike
unique_ptr<T[]>
where you need to store the length as a separate variable. Hence, less error-prone, but more memory footprint (not likely to matter unless you have lots ofvector
s).
std::array<T, N>
: This is similar to a C-style arrayT[N]
, but offers an API more consistent with other C++ containers. It shares some of the same limitations as C-style arrays though:A crucial limitation is that the length has to be known at compile-time (
constexpr
orstatic const
), whereas bothunique_ptr<T[]>
andvector
accept lengths at run time.A large
array<T, N>
allocated on the stack can overflow the stack, so you generally want to keepN
rather small, or use a smart pointer tostd::array
.