An 18E Block Algorithm for Power Series Power
Translated by ChatGPT.
Draft…
I finally got something somewhat original done.
In the previous post I wrote about fourth-order Newton iteration, and at the end I mentioned an idea for an 18E pow algorithm. Now I will write it down.
Problem
Power series power is defined as
The most direct method is
In general, if you use a relatively plain 13E ln and 17E exp, you get a 30E pow algorithm. If you choose the SOTA 10E ln + 11.5E exp, you get a 21.5E pow. In my experiments, 10E ln + 14E exp performs the best in practice, giving a 24E pow.
There are also standalone pow algorithms. The current SOTA is 20.25E. This post proposes a new 18E algorithm for computing power series power.
When I was trying online convolution before, I wondered whether it was possible to maintain the iteration formulas on both sides at the same time and get better complexity. This time I continued along that line and made some progress.
Blocking
I recommend reading Block optimizations for Newton iteration on power series. This post uses quite a few conclusions from it.
Let the block length be , and let . Write
Assume we already know , and now want to compute .
In the previous block inverse method, we had the following conclusion. First, let be the convolution of without , namely
Then we have
Next we will apply this difference of two convolutions to both sides of the iteration formula.
Differential Equation
Following the earlier notation, consider the in-place differential operator .
For convenience, write , i.e. a shifted differential operator. Then
Consider . Differentiating both sides gives
Add to both sides:
This equation is more suitable for blocking, because the left side contains , while the final unknown we need to solve for is the new block of . Its block form is
Now consider the two sides when the -th block is missing. On the left side, we maintain and apply formula (1), obtaining
For the right side, maintain and apply formula (1) again:
Then using equation (2), we get
Now we need to solve for . Solving this directly is difficult: one side contains , while the other contains , so they do not cancel.
Consider the substitution
Since ,
Also notice that
These are exactly the two terms in formula (2), so
That is,
Concrete Procedure
The method for computing and is exactly the same as in the previous block method: compute the -th block of and . Since is shared, we only need to maintain the block DFTs of , and two IDFTs give and .
The concrete computation is:
- Assume has already been computed.
- Compute . This 2E cost is counted inside the iteration, so it has no extra cost here.
- Compute , costing 4E.
- Precompute using any 6E algorithm.
- Compute each block in order:
- Compute , , and , costing 6E.
- Use block multiplication to compute , costing 2E.
- Use block multiplication to compute , costing 2E.
- Compute , linear cost.
- Compute , costing 4E.
- Compute , linear cost.
- Compute , costing 4E.
In short, the complexity per new block is 18E. When the number of blocks , the amortized complexity is also 18E.
Strictly speaking, the complexity satisfies .
Implementation And Tests
The performance test uses NTT modulus , fixes the exponent to , and uses AVX2 optimization.
| Algorithm | 1024 us/op | 8192 us/op | 65536 us/op | 524288 us/op |
|---|---|---|---|---|
| pow24e | 22.407 | 210.397 | 2053.427 | 19850.235 |
| pow18e | 18.233 | 169.529 | 1593.649 | 15228.692 |
The speedup is fairly stable, roughly a bit over 20%, which matches expectation.
Summary
This started as idle speculation, but the more I thought about it, the more reasonable it became, and eventually it turned into this post. At last, I have made a small thing that is actually my own.