Fourth-Order Newton Iteration for Power Series: 9.33E inv and 10.33E invsqrt
Translated by ChatGPT.
Draft…
After translating David Harvey’s two papers, namely Block optimizations for Newton iteration on power series, I have always been curious about the implementation of inv 9E. It is a third-order Newton iteration, so its complexity is a noticeable step below other second-order iterations, and it also remains useful in the block-method setting.
So naturally I had a thought: can there be a fourth-order Newton iteration?
After some thinking, I did manage to derive a 328 E inverse algorithm and a 331 E inverse square root algorithm.
Let me describe the algorithm below. The notation is the same as in the previous post: E(n) denotes one DFT/IDFT of length n.
Inverse
Assume we already have the first n terms of the inverse of f. Let
g0=f−1modxn
To extend it to 4n terms, suppose g has the form
g=g0(1+c0xn+c1x2n+c2x3n)+O(x4n)
To solve for the polynomials c0,c1,c2, we can compute the residual of fg0:
One thing to notice is that each ai here is a polynomial block of length n, so multiplication carries into the next block. We still need to reduce properly. Write
a02=q0+q1xn
Substituting this in gives exactly the formula we need to compute:
If we could compute the first three blocks of C directly, we would be done. The problem is that the residual for inverse square root is fg02−1, where g02 already has 2n terms and f has 4n terms, so a direct ordinary convolution would need length 8n. We therefore need a cyclic-convolution trick.
Observe that f times g02 has at most 6 terms, and the first term is 1, so in fact there are only 5 blocks. Thus the 4n cyclic convolution can be written as
fg02mod(x4n−1)=1+a3+(a0+a4)xn+a1x2n+a2x3
Next we need to extract a4. One way is to compute one more point value:
Here ξ8 is an 8th root of unity, ξ84=−1, and η=ξ8n, i.e. an 8n-th root of unity. Since a1,a2,a3 are not mixed, we can compute a0 from this.
This trick is essentially the same as the DFT(3n) trick used by inv 9E; here it becomes DFT(4n).
After obtaining a0,a1,a2, the rest is basically the same as before, so I will not repeat the details. The concrete computation is:
We know g0modxn.
Compute fg02mod(x4n−1), costing 3E(4n).
Use a length-n twisted DFT to compute fg02 at xn=η, costing 3E(n).
Recover a0,a1,a2.
Compute a02=q0+q1xn, costing 2E(2n).
Compute a0(43a1−165q0)modxn, costing 2E(2n), reusing F2n(a0) from the previous step.
Form c=c0+c1xn+c2x2n.
Compute g0c, costing 2E(4n), reusing F4n(g0) from the first step.
Thus one n→4n step costs
3E(4n)+3E(n)+2E(2n)+2E(2n)+2E(4n)=31E(n).
So the total complexity is
T(4n)=T(n)+31E(n),T(n)=331E(n)≈10.33E(n).
Performance Comparison
The NTT modulus is P=998244353, and the implementation uses AVX2 optimizations.
inv
Algorithm
1024 us/op
4096 us/op
8192 us/op
65536 us/op
524288 us/op
inv9e
8.626
38.098
80.901
780.210
7359.195
inv9.33e
8.146
36.459
78.166
765.066
7362.018
inv7.5e
8.503
45.364
94.343
884.295
8058.103
inv12e
11.197
51.168
110.128
1069.600
10351.450
inv10e
8.395
38.762
82.788
810.673
7553.723
invsqrt
Algorithm
1024 us/op
8192 us/op
65536 us/op
524288 us/op
invsqrt12e
11.908
116.266
1114.256
10902.573
invsqrt10.33e
10.012
92.285
895.062
8783.462
Summary
By trying a fourth-order Newton iteration, this post derives a 9.33E algorithm for power series inverse and a 10.33E algorithm for power series inverse square root. The inv algorithm does not improve the theoretical complexity, but its real-machine performance is quite good. For invsqrt, I have not found related literature; perhaps this operator simply does not get much attention. It seems that forums usually use the standard 12E Newton iteration, so this approach might be the first time it appears on the internet.
I also have an idea for an 18E pow algorithm. After I try to verify it, I may submit implementations of the two algorithms in this post to OJ and test them there.