Several Ways to Optimize Modular Reduction
Translated by ChatGPT.
Modulo is slow, especially when the modulus is dynamic.
I have not studied error analysis, so the explanation here is probably rather amateurish (
Floating-point implementation
Clearly we have the identity
In general, is in the i32 range, and . The simplest implementation is floating-point division.
struct ModF {
u32 m;
f64 ivm;
ModF(i32 m_) : m(m_), ivm(1.0 / m) {}
u32 calc(u64 a) const {
u32 r = a - i64(a * ivm) * m;
if (r >= m)
r -= m;
if (r < 0)
r += m;
return r;
}
};
Barrett Reduction
Consider an alternative to division. Choose such that
holds for as many as possible. Since is an integer, let the error be
When , the division is exact, so it is convenient to choose
Then we can derive
After simplification, . If , we roughly need .
struct Barrett {
enum { s = 96 };
static constexpr u128 s2 = u128(1) << s;
u32 m;
u128 ivm;
Barrett(u32 m_) : m(m_), ivm((s2 - 1) / m + 1) {}
u32 div(u64 a) const {
return a * ivm >> s;
}
u32 calc(u64 a) const {
return a - u64(div(a)) * m;
}
};
Multiplication by a fixed value
If we need to compute , where is fixed, part of the formula above can be merged.
The derivation is essentially the same. Reconsider the substituted expression and still choose
Since is an integer, let the error be
When , the division is exact. We derive
After simplification, . If , we roughly need .
struct MulBarrett {
enum { s = 64 };
static constexpr u128 s2 = u128(1) << s;
u32 b, m;
u64 ivm;
MulBarrett(u32 b_, u32 m_) : b(b_), m(m_), ivm((s2 * b - 1) / m + 1) {}
u64 div(u64 a) const {
return u128(a) * ivm >> s;
}
u32 calc(u32 a) const {
return a * b - u64(div(a)) * m;
}
};
Adjustment
Modulo does not have to be done perfectly in one step.
Consider one adjustment step (it adjusts the modulo result; the division itself cannot be adjusted). Let , and we get . For example, when , .
Therefore, fixed-value multiplication can be done entirely under u64; if the carry flag is obtained through special means, u64 modulo can also be completed entirely under u64.
Since adjustment is allowed, there is no need to be constrained to rounding up. Rounding down can also be implemented.
Lemire Reduction
The modulo result can be obtained more directly:
Assume the division is exact, i.e. . Notice that this can be turned into a modulo:
The precision is still .
struct Lemire {
enum { s = 96 };
static constexpr u128 s2 = u128(1) << s;
u32 m;
u128 q;
Lemire(u32 m_) : m(m_), q((s2 - 1) / m + 1) {}
u32 calc(u64 a) const {
return a * q % s2 * u128(m) >> s;
}
};
Multiplication by a fixed value
The derivation is the same:
The precision is still the division-exact .
struct MulLemire {
enum { s = 64 };
static constexpr u128 s2 = u128(1) << s;
u32 b, m;
u64 q;
MulLemire(u32 b_, u32 m_) : b(b_), m(m_), q((s2 * b - 1) / m + 1) {}
u32 calc(u32 a) const {
return a * q * u128(m) >> s;
}
};
Adjustment
Lemire Reduction cannot be adjusted.
The previous method can be adjusted because the division error is 1, so after taking modulo the result is only off by a multiple of . Here we directly compute the modulo result. If it is off by 1 or 2, there is no way to correct it, so adjustment is impossible.
Montgomery multiplication
Montgomery multiplication can make better use of SIMD acceleration, and compilers can also auto-vectorize it.
The adjusted Barrett fixed-value multiplication optimization does not involve u128, so compilers can vectorize it as well, but the division cost is high. In short, it still does not work very well.
Montgomery space
Let the constant satisfy and . Usually we choose or .
Define the value of in Montgomery space as
Addition and subtraction are trivial:
Multiplication is a bit special:
So the key lies in two functions:
transform: compute .reduce: compute .
Reduce
Montgomery pointed out that division by 2 does not require division:
And when, for example, ,
Therefore set , and we have
Considering the value range, the expression above is less than . But the intermediate value does not have to be normalized; we can be lazy and normalize only when mapping back out.
Transform
The input is usually , so multiplying directly is enough. We can also use reduction, but it is not better.
Maybe Barrett can optimize this? I have not tested it; I guess the adjusted version might be faster.
Faster inverse modulo
When is a power of , the inverse can be computed with Newton iteration. Suppose is the inverse of modulo :
Then
Implementation
struct Montgomery {
u32 m, ir;
static u32 inv_m(u32 m) {
u32 x = 1;
for (i32 i = 0; i < 5; ++i)
x *= 2 - x * m;
return x;
}
Montgomery(u32 m_) : m(m_), ir(-inv_m(m)) {}
u32 tran(u32 a) const {
return (u64(a) << 32) % m;
}
u32 val(u32 a) const { // itrans
return redc_m(redc(a) - m);
}
u32 add(u32 a, u32 b) const {
return redc_m2(a + b - m * 2);
}
u32 sub(u32 a, u32 b) const {
return redc_m2(a - b);
}
u32 redc_m(u32 a) const {
return a >> 31 ? a + m : a;
}
u32 redc_m2(u32 a) const {
return a >> 31 ? a + m * 2 : a;
}
u32 redc(u64 a) const {
return (a + u64(u32(a) * ir) * m) >> 32;
}
u32 mul(u32 a, u32 b) const {
return redc(u64(a) * b);
}
};
Others
@platelet mentioned the fixed-value multiplication optimization of Lemire Reduction in his blog post A Modulo Algorithm Twice as Fast as the Compiler’s Implementation. But division is too expensive. In a long NTT there are roots of unity to process, so it does not show an advantage. Barrett is actually a bit better.
Since division is expensive, maybe initialization can also be wrapped with Barrett? I will try it next time when I have time.
References
- https://en.wikipedia.org/wiki/Barrett_reduction
- https://en.algorithmica.org/hpc/arithmetic/division/
- https://arxiv.org/pdf/1407.3383.pdf