Notes on FFT and NTT

Translated by ChatGPT.

The following is my intuitive understanding of FFT. It may not be rigorous; corrections are welcome if there are mistakes.

FFT

The algorithm discussed below is the Cooley-Tukey FFT, which is more widely used in competitive programming.

Prerequisite: complex numbers, and an understanding of Euler’s formula.

Polynomial multiplication

For degree-nn polynomials

f(x)=i=0nfixi=f0+f1x+f2x2++gnxng(x)=i=0ngixi=g0+g1x+g2x2++gnxn\begin{aligned} f(x) = \sum_{i=0}^n f_ix^i &= f_0 + f_1 x + f_2x^2 + \cdots + g_nx^n \\ g(x) = \sum_{i=0}^n g_ix^i &= g_0 + g_1 x + g_2x^2 + \cdots + g_nx^n \end{aligned}

Their convolution is F(x)=f(x)g(x)=(fg)(x)=k=02nckxkF(x) = f(x) \ast g(x) = (f \ast g)(x) = \sum\limits_{k=0}^{2n} c_kx^k, where

ck=i+j=kfigjc_k = \sum_{i+j=k}f_ig_j

So a naive polynomial convolution needs n2n^2 coefficient multiplications, and we need to optimize it.

Point-value representation

A degree-nn polynomial f(x)f(x) can be determined by n+1n+1 coefficients, and it can also be determined by n+1n+1 coordinates (point values). In other words, a degree-nn polynomial can be viewed as an (n+1)(n+1)-dimensional vector.

Consider choosing 2n+12n+1 coordinates to determine f(x)f(x) and g(x)g(x). Then F(x)F(x) can be obtained simply by doing 2n+12n+1 multiplications:

(xk,F(xk))=(xk,f(xk)g(xk))(x_k,F(x_k)) = \left(x_k, f(x_k)g(x_k)\right)

Now we have a new idea: first convert from coefficient representation to point-value representation, do the multiplication, and then convert back.

DFT

How do we convert a polynomial into point values? We have the discrete Fourier transform.

The nn solutions to the equation xn=1x^n = 1 are called the nn-th roots of unity ζn\zeta_n. For a given polynomial f(x)=k=0n1fkxkf(x) = \sum\limits_{k=0}^{n-1} f_kx^k and a root of unity ζn\zeta_n, define the vector

DFTζn(f)=(f(1),f(ζn1),,f(ζnn1))\operatorname{DFT}_{\zeta_n}(f) =( f(1), f(\zeta_n^1), \cdots, f(\zeta_n^{n-1}) )

as the discrete Fourier transform of ff.

DFT has an inverse transform (IDFT), which converts point values back into coefficients. It is still a vector-to-vector transform.

IDFT has a key property:

(DFTζ)1=1n(DFTζ1)(1)(\operatorname{DFT}_{\zeta})^{-1} = \frac{1}{n} (\operatorname{DFT}_{{\zeta}^{-1}}) \tag{1}

We will prove it later. For now, we can handle DFT and IDFT uniformly.

For convenience, in the following we write DFTζn\operatorname{DFT}_{\zeta_n} simply as Fn\mathcal{F}_n.

Primitive root of unity

At this point, the complexity of computing DFT is still O(n2)O(n^2). The key step of FFT is to choose special points to speed up the computation.

One special root of unity is denoted ζn=e2πin\zeta_n = e^{\frac{2 \pi i}{n}}, called a primitive root of unity. By Euler’s formula,

ζn=e2πin=cos(2πn)+isin(2πn)\zeta_n = e^{\tfrac{2 \pi i}{n}} = \cos \left(\frac{2\pi}{n}\right) + i \sin \left(\frac{2\pi}{n}\right)

That is, ζn\zeta_n is a point on the unit circle, and all nn roots of unity

xk=ζnk=ek2πin=cos(2πkn)+isin(2πkn)x_k = \zeta_n^k = e^{k\tfrac{2 \pi i}{n}} = \cos \left(\frac{2\pi k}{n}\right) + i \sin \left(\frac{2\pi k}{n}\right)

correspond exactly to the nn equally spaced points on the unit circle. Therefore, by Euler’s formula, multiplication among roots of unity is just rotating around the unit circle.

It is not hard to verify several properties of the primitive root ζn\zeta_n using Euler’s formula:

  • ζ2n2k=ζnk\zeta_{2n}^{2k} = \zeta_n^k.
  • ζ2nn+k=ζ2nk\zeta_{2n}^{n+k} = -\zeta_{2n}^k.

Divide and conquer

Using the special properties of primitive roots of unity, we can compute DFT by divide and conquer. For example, for a degree-77 polynomial:

f(x)=f0+f1x+f2x2+f3x3+f4x4+f5x5+f6x6+f7x7=(f0+f2x2+f4x4+f6x6)+x(f1+f3x2+f5x4+f7x6)\begin{aligned} f(x) &= f_0 + f_1x + f_2x^2 + f_3 x^3 + f_4 x^4 + f_5 x^5 + f_6 x^6 + f_7 x^7 \\ &= (f_0 + f_2x^2 + f_4x^4 + f_6x^6) + x(f_1 + f_3x^2 + f_5x^4 + f_7x^6) \end{aligned}

Separate odd and even terms:

f[0](x)=f0+f2x+f4x2+f6x3f[1](x)=f1+f3x+f5x2+f7x3\begin{aligned} f^{[0]}(x) &= f_0 + f_2x + f_4x^2 + f_6x^3 \\ f^{[1]}(x) &= f_1 + f_3x + f_5x^2 + f_7x^3 \end{aligned}

Then the original function can be written as

f(x)=f[0](x2)+xf[1](x2)f(x) = f^{[0]}(x^2) + xf^{[1]}(x^2)

In general, for a polynomial f(x)f(x) of degree less than nn, its value at the root of unity x=ζnkx = \zeta_n^k is

f(ζnk)=f[0](ζnkζnk)+ζnkf[1](ζnkζnk)=f[0](ζn2k)+ζnkf[1](ζn2k)=f[0](ζn/2k)+ζnkf[1](ζn/2k)\begin{aligned} f(\zeta_n^k) &= f^{[0]}(\zeta_n^k \cdot \zeta_n^k) + \zeta_n^kf^{[1]}(\zeta_n^k \cdot \zeta_n^k) \\ &= f^{[0]}(\zeta_n^{2k}) + \zeta_n^kf^{[1]}(\zeta_n^{2k}) \\ &= f^{[0]}(\zeta_{n/2}^{k}) + \zeta_n^kf^{[1]}(\zeta_{n/2}^{k}) \end{aligned}

Similarly,

f(ζnk+n/2)=f[0](ζn2k+n)+ζnk+n/2f[1](ζn2k+n)=f[0](ζn/2k)ζnkf[1](ζn/2k)\begin{aligned} f(\zeta_n^{k+n/2}) &= f^{[0]}(\zeta_n^{2k+n}) + \zeta_n^{k+n/2}f^{[1]}(\zeta_n^{2k+n}) \\ &= f^{[0]}(\zeta_{n/2}^{k}) - \zeta_n^{k}f^{[1]}(\zeta_{n/2}^{k}) \end{aligned}

In DFT form:

Fn(f)[j]=Fn/2(f[0])[j]+ζnjFn/2(f[1])[j]Fn(f)[j+n/2]=Fn/2(f[0])[j]ζnjFn/2(f[1])[j](2)\begin{aligned} \mathcal{F}_n(f)[j] &= \mathcal{F}_{n/2}(f^{[0]})[j] + \zeta_n^j \mathcal{F}_{n/2}(f^{[1]})[j] \\ \mathcal{F}_n(f)[j + n/2] &= \mathcal{F}_{n/2}(f^{[0]})[j] - \zeta_n^j\mathcal{F}_{n/2}(f^{[1]})[j] \end{aligned} \tag{2}

Therefore, we need to pad the number of polynomial coefficients up to 2n2^n, which makes divide and conquer convenient.

Now we can write the recursive FFT.

void fft(int n, img *f, int op) {
    static img tmp[1 << 18];
    if (n == 1)
        return;
    for (int i = 0; i < n; i++)
        tmp[i] = f[i];
    for (int i = 0; i < n; i++) {  // put even indices on the left, odd indices on the right
        if (i & 1)
            f[n / 2 + i / 2] = tmp[i];
        else
            f[i / 2] = tmp[i];
    }
    img *g = f, *h = f + n / 2;
    fft(n / 2, g, op), fft(n / 2, h, op);
    img w0 = {cos(2 * PI / n), sin(2 * PI * op / n)}, w = {1, 0};
    for (int k = 0; k < n / 2; k++) {
        tmp[k] = g[k] + w * h[k];
        tmp[k + n / 2] = g[k] - w * h[k];
        w = w * w0;
    }
    for (int i = 0; i < n; i++)
        f[i] = tmp[i];
}

Butterfly transform

Recursive divide and conquer is never quite satisfactory. In the first few lines we are only doing recursive grouping, so we can consider doing it in one step.

Again take a degree-77 polynomial as an example:

  • Initial: {x0,x1,x2,x3,x4,x5,x6,x7}\{x^0,x^1,x^2,x^3,x^4,x^5,x^6,x^7\}
  • Once: {x0,x2,x4,x6},{x1,x3,x5,x7}\{x^0,x^2,x^4,x^6\},\{x^1,x^3,x^5,x^7\}
  • Twice: {x0,x4},{x2,x6},{x1,x5},{x3,x7}\{x^0,x^4\},\{x^2,x^6\},\{x^1,x^5\},\{x^3,x^7\}
  • End: {x0},{x4},{x2},{x6},{x1},{x5},{x3},{x7}\{x^0\},\{x^4\},\{x^2\},\{x^6\},\{x^1\},\{x^5\},\{x^3\},\{x^7\}

Writing them in binary, we find that the binary representation at the end is exactly the reverse of the beginning.

Initial01234567
Initial(2)000001010011100101110111
End(2)000100010110001101011111
End04261537

This transform is called the butterfly transform, also known as the bit-reversal permutation.

We can preprocess the permutation array in O(n)O(n). Let R(x) be the transformed result of xx; then R(x >> 1) is already known. We just shift R(x >> 1) right by one and fill in the highest bit. The code is:

void pre_rev(int lim) {
    int k = std::__lg(lim);
    rev.resize(lim);
    for (int i = 0; i < lim; ++i) {
        rev[i] = rev[i >> 1] >> 1;
        if (i & 1)
            rev[i] |= lim >> 1;
        // or write it in one line as
        // rev[i] = (rev[i >> 1] >> 1) | ((i & 1) << (k - 1));
    }
}

Now we can write the iterative FFT.

void fft(img *f, int n, int op) { // DIT
    for (int i = 0; i < n; ++i)
        if (i < rev[i])
            swap(f[i], f[rev[i]]);
    for (int l = 1; l <= n / 2; l <<= 1) {
        img w0 = {cos(PI / l), sin(PI * op / l)};
        for (int i = 0; i < n; i += l * 2) {
            img w = {1, 0};
            for (int j = 0; j < l; j++) {
                img x = f[i + j], y = w * f[i + j + l];
                f[i + j] = x + y, f[i + j + l] = x - y;
                w = w * w0;
            }
        }
    }
    if (op == -1)
        for (int i = 0; i < n; i++)
            f[i] = f[i] / n;
}

NTT

Prerequisite: elementary number theory (divisibility and congruence).

Using double to implement integer multiplication is inelegant, with issues in both precision and speed. In fact, we can do everything over integers.

Primitive root

The two essential properties of the primitive root of unity ζn\zeta_n that we use are:

  • ζnn=1\zeta_{n}^{n} = 1.
  • ζ2nn=1\zeta_{2n}^{n} = -1.

This brings to mind the residue field Zp\mathbb{Z}_p modulo pp: its elements are {0,1,,p1}\{0,1,\cdots,p-1\}, and all operations on it are modulo pp. By Fermat’s little theorem,

aφ(p)=ap11a^{\varphi(p)} = a^{p-1} \equiv 1

In other words, the p1p-1 positive integers are all solutions of the congruence equation xp11x^{p-1} \equiv 1.

This has a form very similar to roots of unity. Intuitively, Zp\mathbb{Z}_p should also contain special numbers similar to primitive roots of unity. Below we discuss this over Zp\mathbb{Z}_p and try to prove such a number exists.

Define the order δp(a)\delta_p(a) of a positive integer aZpa \in \mathbb{Z}_p as the smallest rr such that ar1a^r \equiv 1. By Fermat’s little theorem, aφ(p)1a^{\varphi(p)} \equiv 1, so the order of aa must exist and satisfies δp(a)φ(p)\delta_p(a) \mid \varphi(p). It can be proven that

a,a2,aδp(a)(3)a,a^2,\cdots a^{\delta_p(a)} \tag{3}

have pairwise distinct residues modulo pp. By Lagrange’s theorem, xδp(a)1x^{\delta_p(a)} \equiv 1 has at most δp(a)\delta_p(a) solutions, exactly the ones shown in (3)(3).

From divisibility, only iδp(a)i \bot \delta_p(a) gives δp(ai)=δp(a)\delta_p(a^i) = \delta_p(a). That is, aa always brings along

i=1δp(a)[gcd(i,δp(a))=1]=φ(δp(a))\sum_{i=1}^{\delta_p(a)} [\gcd(i, \delta_p(a)) = 1] = \varphi(\delta_p(a))

elements with the same order. Therefore, there are exactly φ(δp(a))\varphi(\delta_p(a)) numbers of order δp(a)\delta_p(a).

Since every positive integer has a uniquely determined order, suppose that for every dφ(p)d \mid \varphi(p), there exist φ(d)\varphi(d) corresponding integers of order dd. Counting the integers gives

dφ(p)φ(d)=φ(p)=p1\sum_{d \mid \varphi(p)} \varphi(d) = \varphi(p) = p - 1

which is exactly the number of all positive integers in Zp\mathbb{Z}_p. Therefore the assumption holds, and there exists an aa such that δp(a)=p1\delta_p(a) = p-1.

We call this aa a primitive root modulo pp, usually denoted by gg.

Number Theoretic Transform

Extract as many factors of 22 from p1p - 1 as possible:

p=Nq+1,N=2mp = N q + 1, N = 2^m

Let gg be a primitive root of Zp\mathbb{Z}_p, and view gNgqg_N \equiv g^q as the analogue of ζn\zeta_n. Using quadratic residues, it is not hard to obtain gNN1g_N^N \equiv 1 and gNN/21g_N^{N/2} \equiv -1.

Common examples are

p=1004535809=479×221+1,g=3p=998244353=7×17×223+1,g=3\begin{aligned} p = 1004535809 = 479 \times 2^{21} + 1&, g = 3 \\ p = 998244353 = 7 \times 17 \times 2^{23} + 1&, g = 3 \end{aligned}

Similarly, we can write the program:

void ntt(ll *f, int n, int type) {
    for (int i = 0; i < n; ++i)
        if (i < rev[i])
            swap(f[i], f[rev[i]]);
    for (int h = 2; h < n; h <<= 1) {
        ll tg = type == 1 ? 3 : g_inv;
        ll gn = qpow(tg, (P - 1) / h);
        for (int j = 0; j < n; j += h) {
            ll g = 1;
            for (int k = j; k < j + h / 2; k++) {
                ll f1 = f[k], f2 = g * f[k + h / 2] % P;
                f[k] = (f1 + f2) % P;
                f[k + h / 2] = (f1 - f2 + P) % P;
                g = g * gn % P;
            }
        }
    }
    ll iv_n = qpow(n);
    if (type == -1)
        for (int i = 0; i < n; i++)
            f[i] = f[i] * iv_n % P;
}

At this point, you have learned FFT. Next we will study FFT more deeply from a mathematical perspective and fill in the theoretical foundation.

Linear transform

DFT is a linear transform. In other words, it can be written as matrix multiplication:

[f(ζn0)f(ζn1)f(ζn2)f(ζnn1)]=[11111ζn1ζn2ζnn11ζn2ζn4ζn2(n1)1ζnn1ζn2(n1)ζn(n1)2][f0f1f2fn1]\begin{bmatrix} f(\zeta_n^0) \\ f(\zeta_n^1) \\ f(\zeta_n^2) \\ \vdots \\ f(\zeta_n^{n-1}) \end{bmatrix} = \begin{bmatrix} 1 & 1 & 1 & \cdots & 1 \\ 1 & \zeta_n^1 & \zeta_n^2 & \cdots & \zeta_n^{n-1} \\ 1 & \zeta_n^2 & \zeta_n^4 & \cdots & \zeta_n^{2(n-1)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & \zeta_n^{n-1} & \zeta_n^{2(n-1)} & \cdots & \zeta_n^{(n-1)^2} \end{bmatrix} \begin{bmatrix} f_0 \\ f_1 \\ f_2 \\ \vdots \\ f_{n-1} \end{bmatrix}

Denote the middle nn-order Vandermonde matrix by V(ζn)=(ζnij)V(\zeta_n) = (\zeta_n^{ij}).

Directly computing the inverse of V(ζn)V(\zeta_n) is hard, but it is easy to verify that the following is a diagonal matrix:

V(ζn)V(ζn1)=(n[i=j])=nInV(\zeta_n) V(\zeta_n^{-1}) = (n[i = j]) = n I_n

Thus the matrix corresponding to IDFT is V1(ζn)=1nV(ζn1)V^{-1}(\zeta_n) = \frac{1}{n} V(\zeta_n^{-1}), proving equation (1)(1).

Removing REV

Actually, the FFT and IFFT implemented above are not dual to each other; it is just the convolution theorem that makes IFFT exactly the inverse operation of FFT. More specifically, we implemented two DITs, so butterfly permutation is needed before computation.

Wikipedia

The core of the computation lies in equation (2)(2), which can be written in matrix form:

[O1O2]=[1ζnj1ζnj][I1I2]\begin{bmatrix} O_1 \\ O_2 \end{bmatrix} = \begin{bmatrix} 1 & \zeta_n^{-j} \\ 1 & -\zeta_n^{-j} \end{bmatrix} \begin{bmatrix} I_1 \\ I_2 \end{bmatrix}

Taking the inverse of the matrix gives

[I1I2]=12[11ζnjζnj][O1O2](4)\begin{bmatrix} I_1 \\ I_2 \end{bmatrix} = \frac{1}{2} \begin{bmatrix} 1 & 1 \\ \zeta_n^{j} & -\zeta_n^{j} \end{bmatrix} \begin{bmatrix} O_1 \\ O_2 \end{bmatrix} \tag{4}

This gives DIF. Similarly, we can implement two DIFs as FFT; in that case the butterfly permutation is performed after computation.

void fft(img *f, int n, int op) { // DIF
    for (int l = n / 2; l >= 1; l >>= 1) {
        img w0 = {cos(PI / l), sin(PI * op / l)};
        for (int i = 0; i < n; i += l * 2) {
            img w = {1, 0};
            for (int j = 0; j < l; j++) {
                img x = f[i + j], y = f[i + j + l];
                f[i + j] = x + y, f[i + j + l] = w * (x - y);
                w = w * w0;
            }
        }
    }
    for (int i = 0; i < n; ++i)
        if (i < rev[i])
            swap(f[i], f[rev[i]]);
    if (op == -1)
        for (int i = 0; i < n; i++)
            f[i] = f[i] / n;
}

It is easy to see that if we use DIF as FFT and DIT as IFFT, no butterfly permutation is needed.

void fft(img *f, int n) {
    for (int l = n / 2; l >= 1; l >>= 1) {
        img w0 = {cos(PI / l), sin(PI / l)};
        for (int i = 0; i < n; i += l * 2) {
            img w = {1, 0};
            for (int j = 0; j < l; j++) {
                img x = f[i + j], y = f[i + j + l];
                f[i + j] = x + y, f[i + j + l] = w * (x - y);
                w = w * w0;
            }
        }
    }
}

void ifft(img *f, int n) {
    for (int l = 1; l <= n / 2; l <<= 1) {
        img w0 = img{cos(PI / l), sin(PI / l)}.conj();
        for (int i = 0; i < n; i += l * 2) {
            img w = {1, 0};
            for (int j = 0; j < l; j++) {
                img x = f[i + j], y = w * f[i + j + l];
                f[i + j] = x + y, f[i + j + l] = x - y;
                w = w * w0;
            }
        }
    }
    for (int i = 0; i < n; i++)
        f[i] = f[i] / n;
}

This is Twisted FFT.

Another interpretation

Notice that

f(x0)=fmod(xx0)f(x_0) = f \bmod (x - x_0)

We can start from this and reexamine the algorithm from the perspective of modulo. Suppose ff can be decomposed as

f=(xnr)(xn+r)f0+(xnr)f1+(xn+r)f2+f3f = (x^n - r)(x^n + r)f_{0} + (x^n - r)f_{1} + (x^n + r)f_{2} + f_3

Let

O1=fmod(xn+r)=2rf1+f3O2=fmod(xnr)=2rf2+f3\begin{aligned} O_1 &= f \bmod (x^n + r) = -2r f_1 + f_3\\ O_2 &= f \bmod (x^n - r) = 2r f_2 + f_3 \end{aligned}

Then

fmod(x2nr2)=O2O12rxn+O2+O12=I1xn+I2f \bmod (x^{2n} - r^2) = \frac{O_2-O_1}{2r}x^n + \frac{O_2 + O_1}{2} = I_1 x^n + I_2

Notice that in the code we do not compute O1O_1 directly. Instead, we multiply the jj-th term by ζ2nj\zeta_{2n}^j, which is equivalent to computing f(ζ2nx)f(\zeta_{2n}x).

We can find that

f(ζ2nx)mod(xn1)=f(ζ2nx)mod((ζ2nx)n1)=f(x)mod(xn+1)f(\zeta_{2n}x) \bmod (x^n-1) = f(\zeta_{2n}x) \bmod ((\zeta_{2n} x)^n - 1) = f(x) \bmod (x^n + 1)

This diagram is not easy to understand. The Original FFT diagram below is easier, but the algorithm widely used today is Twisted FFT.

From the diagram, the FFT process is to push the polynomial from the root to the leaves, obtaining values at all roots of unity. After doing the operation, it pushes from the leaves back to the root.

Original FFT

Of course, we can directly divide and conquer; this is Original FFT.

Due to space, this post will not expand on it.

Preprocessing roots of unity

Recomputing the roots of unity every time is wasteful. We can preprocess them and then use them during computation.

vector<img> ROOT;

void init(int n) {
    static int lim = (ROOT = {{1, 0}}, 1);
    if (lim >= n)
        return;
    ROOT.resize(n);
    for (int l = lim; l < n; l *= 2) {
        img w = {cos(PI / l / 2), sin(PI / l / 2)};
        ROOT[l] = w;
        for (int i = 1; i < l; ++i)
            ROOT[i + l] = ROOT[i] * w;
    }
    lim = n;
}

Other applications

FFT is essentially a tool for quickly computing convolution. In this article, I want to focus more on understanding the computation process of FFT.

FFT has many other applications, such as fast addition and wildcard pattern matching. Later, after solving more problems, I may write another post about them. For now, I have not accumulated enough.

Refence

  1. OI-Wiki Fast Fourier Transform
  2. FFT Introductory Notes - hly1024