Fourth-Order Newton Iteration for Power Series: 9.33E inv and 10.33E invsqrt

Translated by ChatGPT.

Draft…

After translating David Harvey’s two papers, namely Block optimizations for Newton iteration on power series, I have always been curious about the implementation of inv 9E. It is a third-order Newton iteration, so its complexity is a noticeable step below other second-order iterations, and it also remains useful in the block-method setting.

So naturally I had a thought: can there be a fourth-order Newton iteration?

After some thinking, I did manage to derive a 283\frac{28}{3} E inverse algorithm and a 313\frac{31}{3} E inverse square root algorithm.

Let me describe the algorithm below. The notation is the same as in the previous post: E(n)\mathsf{E}(n) denotes one DFT/IDFT of length nn.

Inverse

Assume we already have the first nn terms of the inverse of ff. Let

g0=f1modxng_0 = f^{-1} \bmod {x^n}

To extend it to 4n4n terms, suppose gg has the form

g=g0(1+c0xn+c1x2n+c2x3n)+O(x4n)g = g_0(1 + c_0 x^n + c_1x^{2n} + c_2x^{3n}) + O(x^{4n})

To solve for the polynomials c0,c1,c2c_0,c_1,c_2, we can compute the residual of fg0f g_0:

fg0=1+a0xn+a1x2n+a2x3n+O(x4n)fg_0 = 1 + a_0 x^n + a_1 x^{2n} + a_2 x^{3n} + O(x^{4n})

This gives the equation

fg=(1+a0xn+a1x2n+a2x3n)(1+c0xn+c1x2n+c2x3n)+O(x4n)=1+O(x4n)fg = (1 + a_0 x^n + a_1 x^{2n} + a_2 x^{3n})(1 + c_0 x^n + c_1x^{2n} + c_2x^{3n}) + O(x^{4n})= 1 + O(x^{4n})

Expanding and comparing coefficients gives

{a0+c0=0a1+c1+a0c0=0a2+a1c0+a0c1+c2=0{c0=a0c1=a02a1c2=a2+2a0a1a03\begin{cases} a_0 + c_0 = 0 \\ a_1+c_1+ a_0c_0 = 0 \\ a_2+a_1c_0+a_0c_1+c_2 = 0 \end{cases} \Rightarrow \begin{cases} c_0 = -a_0 \\ c_1 = a_0^2 - a_1 \\ c_2 = -a_2 + 2a_0a_1 - a_0^3 \end{cases}

One thing to notice is that each aia_i here is a polynomial block of length nn, so multiplication carries into the next block. We still need to reduce properly. Write

a02=q0+q1xna_0^2 = q_0 + q_1 x^n

Substituting this in gives exactly the formula we need to compute:

c0=a0,c1=q0a1,c2=q1a2+a0(2a1q0)modxn.\begin{aligned} \overline{c_0} &= -a_0,\\ \overline{c_1} &= q_0 - a_1,\\ \overline{c_2} &= q_1 - a_2 + a_0(2a_1 - q_0) \bmod x^n. \end{aligned}

The computation is also fairly clear:

  1. We know g0modxng_0 \bmod x^n.
  2. Compute fg0modx4nfg_0 \bmod x^{4n}, extract a0,a1,a2a_0,a_1,a_2, costing 3E(4n).
  3. Compute a02=q0+q1xna_0^2 = q_0 + q_1x^n, extract q0,q1q_0,q_1, costing 2E(2n).
  4. Compute a0(2a1q0)modxna_0(2a_1 - q_0) \bmod x^n, costing 2E(2n), reusing F2n(a0)\mathcal{F}_{2n}(a_0) from the previous step.
  5. At this point we obtain c=c0+c1xn+c2x2nc = \overline{c_0} + \overline{c_1} x^n + \overline{c_2}x^{2n}.
  6. Compute g0cg_0 c, costing 2E(4n), reusing F4n(g0)\mathcal{F}_{4n}(g_0) from the first step.

In total, the complexity is

T(n)=3E(4n)+2E(2n)+2E(2n)+2E(4n)=28E(n)T(n) = 3 \mathsf{E}(4n) + 2 \mathsf{E}(2n) + 2 \mathsf{E}(2n) + 2 \mathsf{E}(4n) = 28 \mathsf{E}(n)

Since each iteration sends n4nn \to 4n, the complexity is

T(4n)=T(n)+28E(n),T(n)=283E(n)9.33E(n)T(4n) = T (n) + 28 \mathsf{E}(n), \quad T(n) = \frac{28}{3}\mathsf{E}(n) \approx 9.33 \mathsf{E}(n)

Taylor Expansion

Actually, this equation can be solved more directly. Notice that

C=cxn=fg01=a0xn+a1x2n+a2x3n(modx4n)C = cx^n = f g_0 - 1 = a_0x^n +a_1x^{2n}+a_2x^{3n} \pmod {x^{4n}}

Now consider the Taylor expansion:

(1+C)1=1C+C2C3+O(x4n)=1a0xn+(a02a1)x2n+(2a0a1a03a2)x3n+O(x4n)\begin{aligned} (1+C)^{-1} &= 1 - C + C^2 - C^3 + O(x^{4n}) \\ &= 1 - a_0x^n + (a_0^2-a_1)x^{2n} + (2a_0a_1 - a_0^3-a_2) x^{3n} + O(x^{4n}) \end{aligned}

Then we get the result directly.

Inverse Square Root

The idea is the same. Assume we already have the first nn terms of the inverse square root of ff. Let

g0=f1/2modxng_0 = f^{-1/2} \bmod {x^n}

Similarly, set

C=cxn=fg021=a0xn+a1x2n+a2x3n(modx4n)C = cx^n = f g_0^2 - 1 = a_0x^n +a_1x^{2n}+a_2x^{3n} \pmod {x^{4n}}

Taylor expansion gives

(1+C)1/2=112C+38C2516C3+O(x4n)=112(a0xn+a1x2n+a2x3n)+38(a02x2n+2a0a1x3n)516a03x3n+O(x4n).\begin{aligned} (1+C)^{-1/2} &= 1 - \frac12 C + \frac38 C^2 - \frac{5}{16}C^3 + O(x^{4n}) \\ &= 1 - \frac12(a_0x^n+a_1x^{2n}+a_2x^{3n}) \\ &\quad + \frac38(a_0^2x^{2n}+2a_0a_1x^{3n}) - \frac{5}{16}a_0^3x^{3n} + O(x^{4n}) . \end{aligned}

As with inverse, each aia_i is a polynomial block, so the high half of a02a_0^2 carries into the next block. Again write

a02=q0+q1xn.a_0^2 = q_0 + q_1x^n.

Then the three blocks in the correction factor

g=g0(1+c0xn+c1x2n+c2x3n)+O(x4n)g = g_0(1+c_0x^n+c_1x^{2n}+c_2x^{3n}) + O(x^{4n})

are

c0=12a0,c1=12a1+38q0,c2=12a2+38q1+a0(34a1516q0)modxn.\begin{aligned} \overline{c_0} &= -\frac12 a_0,\\ \overline{c_1} &= -\frac12 a_1 + \frac38 q_0,\\ \overline{c_2} &= -\frac12 a_2 + \frac38 q_1 + a_0\left(\frac34a_1-\frac{5}{16}q_0\right) \bmod x^n. \end{aligned}

If we could compute the first three blocks of CC directly, we would be done. The problem is that the residual for inverse square root is fg021fg_0^2-1, where g02g_0^2 already has 2n2n terms and ff has 4n4n terms, so a direct ordinary convolution would need length 8n8n. We therefore need a cyclic-convolution trick.

Observe that ff times g02g_0^2 has at most 6 terms, and the first term is 11, so in fact there are only 5 blocks. Thus the 4n4n cyclic convolution can be written as

fg02mod(x4n1)=1+a3+(a0+a4)xn+a1x2n+a2x3fg_0^2 \bmod (x^{4n} - 1) = 1 + a_3 + (a_0+a_4)x^n + a_1 x^{2n} + a_2x^{3}

Next we need to extract a4a_4. One way is to compute one more point value:

fg02mod(xnξ8)=f(ηx)g02(ηx)mod(xn1)=1+ξ8a0+ξ82a1+ξ83a2+ξ84a3+ξ85a4\begin{aligned} fg_0^2 \bmod (x^{n} - \xi_8) &= f(\eta x)g_0^2(\eta x) \bmod (x^n - 1) \\ &= 1 + \xi_8 a_0 + \xi_8^2a_1 + \xi_8^3a_2 + \xi_8^4a_3 + \xi_8^5a_4 \end{aligned}

Here ξ8\xi_8 is an 8th root of unity, ξ84=1\xi_8^4 = -1, and η=ξ8n\eta = \xi_{8n}, i.e. an 8n8n-th root of unity. Since a1,a2,a3a_1,a_2,a_3 are not mixed, we can compute a0a_0 from this.

This trick is essentially the same as the DFT(3n) trick used by inv 9E; here it becomes DFT(4n).

After obtaining a0,a1,a2a_0,a_1,a_2, the rest is basically the same as before, so I will not repeat the details. The concrete computation is:

  1. We know g0modxng_0 \bmod x^n.
  2. Compute fg02mod(x4n1)fg_0^2 \bmod (x^{4n}-1), costing 3E(4n).
  3. Use a length-nn twisted DFT to compute fg02fg_0^2 at xn=ηx^n=\eta, costing 3E(n).
  4. Recover a0,a1,a2a_0,a_1,a_2.
  5. Compute a02=q0+q1xna_0^2=q_0+q_1x^n, costing 2E(2n).
  6. Compute a0(34a1516q0)modxna_0(\frac34a_1-\frac{5}{16}q_0)\bmod x^n, costing 2E(2n), reusing F2n(a0)\mathcal{F}_{2n}(a_0) from the previous step.
  7. Form c=c0+c1xn+c2x2nc=\overline{c_0}+\overline{c_1}x^n+\overline{c_2}x^{2n}.
  8. Compute g0cg_0c, costing 2E(4n), reusing F4n(g0)\mathcal{F}_{4n}(g_0) from the first step.

Thus one n4nn\to 4n step costs

3E(4n)+3E(n)+2E(2n)+2E(2n)+2E(4n)=31E(n).3\mathsf{E}(4n)+3\mathsf{E}(n)+2\mathsf{E}(2n)+2\mathsf{E}(2n)+2\mathsf{E}(4n)=31\mathsf{E}(n).

So the total complexity is

T(4n)=T(n)+31E(n),T(n)=313E(n)10.33E(n).T(4n)=T(n)+31\mathsf{E}(n), \quad T(n)=\frac{31}{3}\mathsf{E}(n)\approx 10.33\mathsf{E}(n).

Performance Comparison

The NTT modulus is P=998244353P=998244353, and the implementation uses AVX2 optimizations.

inv

Algorithm1024 us/op4096 us/op8192 us/op65536 us/op524288 us/op
inv9e8.62638.09880.901780.2107359.195
inv9.33e8.14636.45978.166765.0667362.018
inv7.5e8.50345.36494.343884.2958058.103
inv12e11.19751.168110.1281069.60010351.450
inv10e8.39538.76282.788810.6737553.723

invsqrt

Algorithm1024 us/op8192 us/op65536 us/op524288 us/op
invsqrt12e11.908116.2661114.25610902.573
invsqrt10.33e10.01292.285895.0628783.462

Summary

By trying a fourth-order Newton iteration, this post derives a 9.33E algorithm for power series inverse and a 10.33E algorithm for power series inverse square root. The inv algorithm does not improve the theoretical complexity, but its real-machine performance is quite good. For invsqrt, I have not found related literature; perhaps this operator simply does not get much attention. It seems that forums usually use the standard 12E Newton iteration, so this approach might be the first time it appears on the internet.

I also have an idea for an 18E pow algorithm. After I try to verify it, I may submit implementations of the two algorithms in this post to OJ and test them there.