Block Optimization for Polynomial Newton Iteration
Translated by ChatGPT.
Draft…
Semi-online convolution can be multi-ary, so why can’t Newton iteration be multi-ary too?
In this post, we use blocking to obtain sqrt(8E), inv(10E), div(10E), and exp(14E), which improves the constants compared with previous methods.
This post can be viewed as a translation of Faster algorithms for the square root and reciprocal of power series and Faster exponentials of power series. The original papers also contain inv(8.66E) and exp(13E) algorithms, but they are too complicated, so I did not try them.
This is my first time writing such an article, and many parts are probably not standardized. Comments are welcome.
Blocking principle
For two polynomials of length , we split into blocks, denoted as
Similarly for . To simplify notation, write .
Their product has some extra terms. Take as an example:
Notice that
because has terms and can be folded back through cyclic convolution. Written in full, this is
In general, for the -th block of , we have
So the corresponding computation is
And , so it does not need to be computed. That is, when the DFT of each block has already been computed, we only need one IDFT to compute the -th block.
Square root
This is actually very similar to semi-online convolution, except that relax eventually updates one point, while here an entire block is iterated.
Similar to ordinary Newton iteration, suppose have already been obtained. Then observe the effect caused by the missing :
Thus
The concrete steps are:
- Given , precompute using any 6E algorithm.
- Compute , costing 2E.
-
- Suppose we are computing the -th block.
- Compute , costing 2E.
- Compute using the optimization above, which only needs one IDFT, costing 2E.
- Compute , costing 4E.
In short, according to the estimate in the paper, if we compute by splitting into blocks, the cost is about 8E.
Poly Poly::sqrt() const {
if (deg() == 1) {
return {front().sqrt()};
}
const int R = 16, iv2 = qpow(2);
int m = get_lim((deg() - 1) / R + 1);
Poly x = cut(m).sqrt(), h = x.inv().ntt(m * 2);
vector<Poly> ng(R);
for (int k = 1; x.deg() < deg(); k++) {
ng[k - 1] = x.cut(m, (k - 1) * m).ntt(m * 2);
Poly psi(m * 2);
for (int j = 0; j < k; j++) {
if (j >= 1) {
for (int i = 0; i < m; i++)
psi[i] -= ng[j][i] * (ng[k - j][i] + ng[k - 1 - j][i]);
for (int i = m; i < m * 2; i++)
psi[i] -= ng[j][i] * (ng[k - j][i] - ng[k - 1 - j][i]);
} else {
for (int i = 0; i < m; i++)
psi[i] -= ng[j][i] * ng[k - 1 - j][i];
for (int i = m; i < m * 2; i++)
psi[i] += ng[j][i] * ng[k - 1 - j][i];
}
}
psi.intt(m * 2).fillZeroH(m * 2);
for (int j = 0; j < min(m, deg() - m * k); j++)
psi[j] += (*this)[m * k + j];
mul(psi, h, m * 2);
x.redeg((k + 1) * m);
for (int i = 0; i < m; i++)
x[m * k + i] = psi[i] * iv2;
}
return x.cut(deg());
}
Actual testing shows that the iterative version’s constant is sensitive to . I suspect this is because the inv cost is too large, and it only gains an advantage when enough iterations are performed. Recursion is better than iteration in terms of splitting. The number of blocks does not have to be a power of .
See the submission for the code.
Reciprocal
Its block form is
The concrete steps are:
- Given , and , no computation is needed.
-
- Suppose we are computing the -th block.
- Compute and , costing 4E.
- Compute , costing 2E.
- Compute , costing 4E.
Poly Poly::inv() const { // 10E
if (deg() == 1) {
return {front().inv()};
}
const int R = 16;
int m = get_lim((deg() - 1) / R + 1);
Poly x = cut(m).inv();
vector<Poly> nf(R), ng(R);
nf[0] = cut(m).ntt(m * 2);
for (int k = 1; x.deg() < deg(); k++) {
nf[k] = cut(m, k * m).ntt(m * 2); // 2E
ng[k - 1] = x.cut(m, (k - 1) * m).ntt(m * 2); // 2E
Poly psi(m * 2);
for (int j = 0; j < k; j++) {
for (int i = 0; i < m; i++)
psi[i] -= ng[j][i] * (nf[k - j][i] + nf[k - 1 - j][i]);
for (int i = m; i < m * 2; i++)
psi[i] -= ng[j][i] * (nf[k - j][i] - nf[k - 1 - j][i]);
}
psi.intt(m * 2).fillZeroH(m * 2); // 2E
mul(psi, ng[0], m * 2); // 4E
x.redeg((k + 1) * m);
for (int i = 0; i < m; i++)
x[m * k + i] = psi[i];
}
return x.cut(deg());
}
When , the cost is about 10E. See the submission for the code.
In the original paper, the polynomial is split into blocks. The first block uses the same method, and the latter two blocks reuse computations better, reaching . I have not read that part yet; interested readers can try it.
Quotient
The paper does not include this. I think adding a on the right side should not affect much, and trying it confirms that it works.
Since does not need DFT, the method is basically the same as reciprocal, so I will not repeat it.
Poly Poly::div(Poly f) const { // 10E
if (deg() == 1) {
return {front() * f[0].inv()};
}
f.redeg(deg());
const int R = 16;
int m = get_lim((deg() - 1) / R + 1);
Poly x = cut(m).div(f), h = f.cut(m).inv().ntt(m * 2);
vector<Poly> nf(R), ng(R);
nf[0] = f.cut(m).ntt(m * 2);
for (int k = 1; x.deg() < deg(); k++) {
nf[k] = f.cut(m, k * m).ntt(m * 2);
ng[k - 1] = x.cut(m, (k - 1) * m).ntt(m * 2);
Poly psi(m * 2);
for (int j = 0; j < k; j++) {
for (int i = 0; i < m; i++)
psi[i] -= ng[j][i] * (nf[k - j][i] + nf[k - 1 - j][i]);
for (int i = m; i < m * 2; i++)
psi[i] -= ng[j][i] * (nf[k - j][i] - nf[k - 1 - j][i]);
}
psi.intt(m * 2).fillZeroH(m * 2);
for (int j = 0; j < min(m, deg() - m * k); j++)
psi[j] += (*this)[m * k + j];
mul(psi, h, m * 2);
x.redeg((k + 1) * m);
for (int i = 0; i < m; i++)
x[m * k + i] = psi[i];
}
return x.cut(deg());
}
When , the cost is about 10E. See the submission for the code.
Exponential
The exponential is different from the previous ones. Its differential equation contains on both sides, so it cannot be computed directly.
Similarly, write it in block form:
For convenience, define , so
Let . Notice that
Therefore
Then can be extracted faster.
The concrete computation:
- Given .
- Compute and , costing 6E + 2E.
- Compute , costing 2E.
-
- Suppose we are computing the -th block.
- Compute and , costing 2E + 2E.
- Compute , costing 2E.
- Compute , costing 4E.
- Compute , costing 4E.
When , the cost is about 14E. See the submission for the code.
Poly Poly::exp() const { // 14E
if (deg() == 1) {
return {1};
}
const int S = 16;
int m = get_lim((deg() - 1) / S + 1);
Poly x = cut(m).exp(), u = x.inv();
vector<Poly> nf(S), ng(S);
Poly df = *this;
for (int i = 0; i < df.deg(); i++)
df[i] *= i;
u.ntt(m * 2);
nf[0] = df.cut(m).ntt(m * 2);
for (int k = 1; x.deg() < deg(); k++) {
nf[k] = df.cut(m, k * m).ntt(m * 2);
ng[k - 1] = x.cut(m, m * (k - 1)).ntt(m * 2);
Poly psi(m * 2);
for (int j = 0; j < k; j++) {
for (int i = 0; i < m; i++)
psi[i] += ng[j][i] * (nf[k - j][i] + nf[k - 1 - j][i]);
for (int i = m; i < m * 2; i++)
psi[i] += ng[j][i] * (nf[k - j][i] - nf[k - 1 - j][i]);
}
psi.intt(m * 2).fillZeroH(m * 2);
mul(psi, u, m * 2).fillZeroH(m * 2);
pre_inv(m * (k + 2));
for (int i = 0; i < m * 2; i++)
psi[i] *= Inv[m * k + i];
mul(psi, ng[0], m * 2).fillZeroH(m * 2);
x.redeg((k + 1) * m);
for (int i = 0; i < m; i++)
x[m * k + i] = psi[i];
}
return x.cut(deg());
}
In the original paper, splitting into blocks reaches 13E. I have not read that part yet; interested readers can try it.
Refence
https://arxiv.org/abs/0910.1926
https://arxiv.org/abs/0911.3110
https://negiizhao.blog.uoj.ac/blog/4671
https://hly1204.github.io/library/math/formal_power_series/formal_power_series.hpp