An 18E Block Algorithm for Power Series Power

Translated by ChatGPT.

Draft…

I finally got something somewhat original done.

In the previous post I wrote about fourth-order Newton iteration, and at the end I mentioned an idea for an 18E pow algorithm. Now I will write it down.

Problem

Power series power is defined as

g=fkmodxn,f(0)=1g = f^k \bmod {x^n}, \qquad f(0)=1

The most direct method is

g=exp(klnf).g = \exp(k \ln f).

In general, if you use a relatively plain 13E ln and 17E exp, you get a 30E pow algorithm. If you choose the SOTA 10E ln + 11.5E exp, you get a 21.5E pow. In my experiments, 10E ln + 14E exp performs the best in practice, giving a 24E pow.

There are also standalone pow algorithms. The current SOTA is 20.25E. This post proposes a new 18E algorithm for computing power series power.

When I was trying online convolution before, I wondered whether it was possible to maintain the iteration formulas on both sides at the same time and get better complexity. This time I continued along that line and made some progress.

Blocking

I recommend reading Block optimizations for Newton iteration on power series. This post uses quite a few conclusions from it.

Let the block length be mm, and let X=xmX=x^m. Write

f=f[0]+f[1]X+,g=g[0]+g[1]X+.f = f_{[0]} + f_{[1]}X + \cdots,\qquad g = g_{[0]} + g_{[1]}X + \cdots.

Assume we already know g[0],,g[s1]g_{[0]},\cdots,g_{[s-1]}, and now want to compute g[s]g_{[s]}.

In the previous block inverse method, we had the following conclusion. First, let ψ\psi be the convolution of fgfg without g[s]g_{[s]}, namely

ψ=(f[0..s]g[0..s1])[s]\psi = (f_{[0..s]}g_{[0..s-1]})_{[s]}

Then we have

(fg)[s]ψ=(f[0..s]g[0..s])[s](f[0..s]g[0..s1])[s]=(f[0..s]g[s]Xs)[s]=f[0]g[s]modX(1)\begin{aligned} (fg)_{[s]} - \psi &= (f_{[0..s]}g_{[0..s]})_{[s]} - (f_{[0..s]}g_{[0..s-1]})_{[s]} \\ &= (f_{[0..s]} g_{[s]} X^{s})_{[s]} \\ &= f_{[0]} \ast g_{[s]} \bmod X \end{aligned} \tag{1}

Next we will apply this difference of two convolutions to both sides of the iteration formula.

Differential Equation

Following the earlier notation, consider the in-place differential operator Δh=xh(x)\Delta h = xh'(x).

For convenience, write Δif=XiΔ(Xif)\Delta_i f = X^{-i} \Delta(X^i f), i.e. a shifted differential operator. Then

Δ(f[0]+f[1]X+)=(Δ0f[0])+(Δ1f[1])+\Delta (f_{[0]} + f_{[1]} X + \cdots) = (\Delta_0 f_{[0]}) + (\Delta_1 f_{[1]}) + \cdots

Consider g=fkg=f^k. Differentiating both sides gives

fΔg=kgΔff\Delta g = k g \Delta f

Add gΔfg \Delta f to both sides:

Δ(fg)=(k+1)gΔf\Delta(fg) = (k+1)g\Delta f

This equation is more suitable for blocking, because the left side contains fgfg, while the final unknown we need to solve for is the new block of gg. Its block form is

(Δfg)[s]=(k+1)(gΔf)[s](2)(\Delta fg)_{[s]} = (k+1)(g \Delta f)_{[s]} \tag{2}

Now consider the two sides when the ss-th block is missing. On the left side, we maintain fgfg and apply formula (1), obtaining

(fg)[s]ψ=f[0]g[s]modX.(fg)_{[s]} - \psi = f_{[0]}g_{[s]} \bmod X.

For the right side, maintain gΔfg \Delta f and apply formula (1) again:

(gΔf)[s]ϕ=(Δf[0])g[s]modX(g \Delta f)_{[s]} - \phi = (\Delta f_{[0]}) g_{[s]} \bmod X

Then using equation (2), we get

Δs(ψ+f[0]g[s])=(k+1)(ϕ+g[s]Δf[0])(modX)\Delta_s (\psi + f_{[0]}g_{[s]}) = (k+1)(\phi + g_{[s]}\Delta f_{[0]}) \pmod X

Now we need to solve for g[s]g_{[s]}. Solving this directly is difficult: one side contains g[s]g_{[s]}, while the other contains Δsg[s]\Delta_s g_{[s]}, so they do not cancel.

Consider the substitution

g[s]=g[0]u.g_{[s]} = g_{[0]}u.

Since g[0]=f[0]kg_{[0]} = f_{[0]}^k,

Δ(f[0]g[0])=f[0]Δg[0]+g[0]Δf[0]=(k+1)g[0]Δf[0].\Delta (f_{[0]}g_{[0]}) = f_{[0]}\Delta g_{[0]} + g_{[0]}\Delta f_{[0]} = (k+1)g_{[0]}\Delta f_{[0]}.

Also notice that

Δs(f[0]g[s])=Δs(f[0]g[0]u)=f[0]g[0]Δsu+uΔ(f[0]g[0])=f[0]g[0]Δsu+(k+1)g[0]uΔf[0]=f[0]g[0]Δsu+(k+1)g[s]Δf[0]\begin{aligned} \Delta_s(f_{[0]}g_{[s]}) &= \Delta_s(f_{[0]}g_{[0]}u) \\ &= f_{[0]}g_{[0]}\Delta_s u + u\Delta (f_{[0]}g_{[0]}) \\ &= f_{[0]}g_{[0]}\Delta_s u + (k+1)g_{[0]} u \Delta f_{[0]} \\ &= f_{[0]}g_{[0]}\Delta_s u + (k+1)g_{[s]} \Delta f_{[0]} \end{aligned}

These are exactly the two terms in formula (2), so

f[0]g[0]Δsu=(k+1)ϕΔsψ.f_{[0]}g_{[0]}\Delta_s u = (k+1)\phi - \Delta_s\psi.

That is,

u=Δs1((k+1)ϕΔsψf[0]g[0]),g[s]=g[0]u.u = \Delta_s^{-1}\left(\frac{(k+1)\phi - \Delta_s\psi}{f_{[0]}g_{[0]}}\right), \qquad g_{[s]} = g_{[0]}u.

Concrete Procedure

The method for computing ψ\psi and ϕ\phi is exactly the same as in the previous block method: compute the ss-th block of fgfg and gΔfg \Delta f. Since gg is shared, we only need to maintain the block DFTs of f,Δf,gf,\Delta f,g, and two IDFTs give ψ\psi and ϕ\phi.

The concrete computation is:

  1. Assume g[0]=f[0]kg_{[0]} = f_{[0]}^k has already been computed.
  2. Compute F2m(g[0])\mathcal{F}_{2m}(g_{[0]}). This 2E cost is counted inside the iteration, so it has no extra cost here.
  3. Compute f[0]g[0]modXf_{[0]}g_{[0]} \bmod X, costing 4E.
  4. Precompute h=(f[0]g[0])1h = (f_{[0]}g_{[0]})^{-1} using any 6E algorithm.
  5. Compute each block ss in order:
    1. Compute F2m(f[s])\mathcal{F}_{2m}(f_{[s]}), F2m(Δsf[s])\mathcal{F}_{2m}(\Delta_s f_{[s]}), and F2m(g[s1])\mathcal{F}_{2m}(g_{[s-1]}), costing 6E.
    2. Use block multiplication to compute ψ\psi, costing 2E.
    3. Use block multiplication to compute ϕ\phi, costing 2E.
    4. Compute ρ=(k+1)ϕΔsψ\rho = (k+1)\phi - \Delta_s\psi, linear cost.
    5. Compute v=hρmodXv = h\rho \bmod X, costing 4E.
    6. Compute u=Δs1vu = \Delta_s^{-1}v, linear cost.
    7. Compute g[s]=g[0]umodXg_{[s]} = g_{[0]}u \bmod X, costing 4E.

In short, the complexity per new block is 18E. When the number of blocks rr \to \infty, the amortized complexity is also 18E.

Strictly speaking, the complexity satisfies T(n)18E(n)+O(nlogn)T(n) \leqslant 18\mathsf{E}(n) + O(n \sqrt{\log n}).

Implementation And Tests

The performance test uses NTT modulus P=998244353P=998244353, fixes the exponent to k=7k=7, and uses AVX2 optimization.

Algorithm1024 us/op8192 us/op65536 us/op524288 us/op
pow24e22.407210.3972053.42719850.235
pow18e18.233169.5291593.64915228.692

The speedup is fairly stable, roughly a bit over 20%, which matches expectation.

Summary

This started as idle speculation, but the more I thought about it, the more reasonable it became, and eventually it turned into this post. At last, I have made a small thing that is actually my own.