Several Ways to Optimize Modular Reduction

Translated by ChatGPT.

Modulo is slow, especially when the modulus is dynamic.

I have not studied error analysis, so the explanation here is probably rather amateurish (

Floating-point implementation

Clearly we have the identity

amodm=aamma \bmod m = a - \left\lfloor \frac{a}{m} \right\rfloor m

In general, mm is in the i32 range, and a[0,m2)a \in [0, m^2). The simplest implementation is floating-point division.

struct ModF {
  u32 m;
  f64 ivm;
  ModF(i32 m_) : m(m_), ivm(1.0 / m) {}
  u32 calc(u64 a) const {
    u32 r = a - i64(a * ivm) * m;
    if (r >= m)
      r -= m;
    if (r < 0)
      r += m;
    return r;
  }
};

Barrett Reduction

Consider an alternative to division. Choose q,sq, s such that

a/m=aq2s,q2sm\lfloor a/m \rfloor = \left\lfloor \frac{aq}{2^s} \right\rfloor, \quad q \approx \frac{2^s}{m}

holds for as many aa as possible. Since aa is an integer, let the error be

q2s=1m+ε\frac{q}{2^s} = \frac{1}{m} + \varepsilon

When 0aε<1m0 \leqslant a\varepsilon < \frac{1}{m}, the division is exact, so it is convenient to choose

q=2sm=2s+m1mq = \left\lceil \frac{2^s}{m} \right\rceil = \left\lfloor \frac{2^s + m - 1}{m} \right\rfloor

Then we can derive

ε=q2s1m<2s+m12sm1m=m1m2s<ε\varepsilon = \frac{q}{2^s} - \frac{1}{m} < \frac{2^s + m - 1}{2^sm} -\frac{1}{m} = \frac{m-1}{m 2^s} < \overline{\varepsilon}

After simplification, 2s>m1mε2^s > \frac{m - 1}{m \overline{\varepsilon}}. If a[0,m2)a \in [0, m^2), we roughly need s3log2ms \approx 3\log_2 m.

struct Barrett {
  enum { s = 96 };
  static constexpr u128 s2 = u128(1) << s;
  u32 m;
  u128 ivm;
  Barrett(u32 m_) : m(m_), ivm((s2 - 1) / m + 1) {}
  u32 div(u64 a) const {
    return a * ivm >> s;
  }
  u32 calc(u64 a) const {
    return a - u64(div(a)) * m;
  }
};

Multiplication by a fixed value

If we need to compute a×bmodma \times b \bmod m, where bb is fixed, part of the formula above can be merged.

The derivation is essentially the same. Reconsider the substituted expression and still choose

q=2sbm=2sb+m1mq = \left\lceil \frac{2^sb}{m} \right\rceil = \left\lfloor \frac{2^sb + m - 1}{m} \right\rfloor

Since aa is an integer, let the error be

q2s=bm+ε\frac{q}{2^s} = \frac{b}{m} + \varepsilon

When 0aε<1m0 \leqslant a\varepsilon < \frac{1}{m}, the division is exact. We derive

ε=q2sbm<2sb+m12smbm=m1m2s<ε\varepsilon = \frac{q }{2^s} - \frac{b}{m} < \frac{2^sb + m - 1}{2^s m} -\frac{b}{m} = \frac{m-1}{m 2^s} < \overline{\varepsilon}

After simplification, 2s>m1mε2^s > \frac{m - 1}{m \overline{\varepsilon}}. If a[0,m)a \in [0, m), we roughly need s2log2ms \approx 2\log_2 m.

struct MulBarrett {
  enum { s = 64 };
  static constexpr u128 s2 = u128(1) << s;
  u32 b, m;
  u64 ivm;
  MulBarrett(u32 b_, u32 m_) : b(b_), m(m_), ivm((s2 * b - 1) / m + 1) {}
  u64 div(u64 a) const {
    return u128(a) * ivm >> s;
  }
  u32 calc(u32 a) const {
    return a * b - u64(div(a)) * m;
  }
};

Adjustment

Modulo does not have to be done perfectly in one step.

Consider one adjustment step (it adjusts the modulo result; the division itself cannot be adjusted). Let 0ε<1m0 \leqslant \varepsilon < \frac{1}{m}, and we get 2s>aa/m2^s > a - a / m. For example, when a[0,m)a \in [0, m), slog2ms \approx \log_2 m.

Therefore, fixed-value multiplication can be done entirely under u64; if the carry flag is obtained through special means, u64 modulo can also be completed entirely under u64.

Since adjustment is allowed, there is no need to be constrained to rounding up. Rounding down can also be implemented.

Lemire Reduction

The modulo result can be obtained more directly:

r=(amam)m=(aq2saεaqmaε)m=(aqaq2saε2s2saε)m2s\begin{aligned} r &= \left\lfloor \left(\frac{a}{m} - \left\lfloor \frac{a}{m} \right\rfloor \right) m \right\rfloor \\ &= \left\lfloor \left(\frac{aq}{2^s} - a \varepsilon - \left\lfloor \frac{aq}{m} - a \varepsilon \right\rfloor \right) m \right\rfloor \\ &= \left\lfloor \left(aq - \left\lfloor \frac{aq}{2^s} - a \varepsilon \right\rfloor 2^s - 2^s a \varepsilon \right) \frac{m}{2^s} \right\rfloor \end{aligned}

Assume the division is exact, i.e. 0aε<1m0 \leqslant a \varepsilon < \frac{1}{m}. Notice that this can be turned into a modulo:

r=(aqmod2s)m2s+aεm=(aqmod2s)m2s\begin{aligned} r &= \left\lfloor \frac{(aq \bmod 2^s) \cdot m}{2^s} + a\varepsilon m \right\rfloor \\ &= \left\lfloor \frac{(aq \bmod 2^s) \cdot m }{2^s} \right\rfloor \end{aligned}

The precision is still s3log2ms \approx 3\log_2 m.

struct Lemire {
  enum { s = 96 };
  static constexpr u128 s2 = u128(1) << s;
  u32 m;
  u128 q;
  Lemire(u32 m_) : m(m_), q((s2 - 1) / m + 1) {}
  u32 calc(u64 a) const {
    return a * q % s2 * u128(m) >> s;
  }
};

Multiplication by a fixed value

The derivation is the same:

r=(aqaq2saε2s2saε)m2s=(aqmod2s)m2s\begin{aligned} r &= \left\lfloor \left(aq - \left\lfloor \frac{aq}{2^s} - a \varepsilon \right\rfloor 2^s - 2^s a \varepsilon \right) \frac{m}{2^s} \right\rfloor \\ &= \left\lfloor \frac{(aq \bmod 2^s) \cdot m }{2^s} \right\rfloor \end{aligned}

The precision is still the division-exact s2log2ms \approx 2\log_2 m.

struct MulLemire {
  enum { s = 64 };
  static constexpr u128 s2 = u128(1) << s;
  u32 b, m;
  u64 q;
  MulLemire(u32 b_, u32 m_) : b(b_), m(m_), q((s2 * b - 1) / m + 1) {}
  u32 calc(u32 a) const {
    return a * q * u128(m) >> s;
  }
};

Adjustment

Lemire Reduction cannot be adjusted.

The previous method can be adjusted because the division error is 1, so after taking modulo the result is only off by a multiple of mm. Here we directly compute the modulo result. If it is off by 1 or 2, there is no way to correct it, so adjustment is impossible.

Montgomery multiplication

Montgomery multiplication can make better use of SIMD acceleration, and compilers can also auto-vectorize it.

The adjusted Barrett fixed-value multiplication optimization does not involve u128, so compilers can vectorize it as well, but the division cost is high. In short, it still does not work very well.

Montgomery space

Let the constant rr satisfy gcd(r,m)=1 \gcd(r, m) = 1 and rmr \geqslant m. Usually we choose 2642^{64} or 2322^{32}.

Define the value of xx in Montgomery space as

x=xrmodm\overline{x} = xr \bmod m

Addition and subtraction are trivial:

x±y=(x±y)rmodm\overline{x} \pm \overline{y} = (x \pm y) r \bmod m

Multiplication is a bit special:

xy=xyr=xyr1modm\overline{x} \ast \overline{y} = xy r = \overline{x} \cdot \overline{y} \cdot r^{-1} \bmod m

So the key lies in two functions:

  • transform: compute xrmodmx \cdot r \bmod m.
  • reduce: compute xr1modmx \cdot r^{-1} \bmod m.

Reduce

Montgomery pointed out that division by 2 does not require division:

a2modm={a2,a is evena+m2,a is odd\frac{a}{2} \bmod m = \begin{cases} \frac{a}{2} &, a \text{ is even} \\ \frac{a + m}{2} &, a\text{ is odd} \\ \end{cases}

And when, for example, r=232r = 2^{32},

aaam(am1)0(modr)a - a \equiv a - m (a m^{-1}) \equiv 0 \pmod r

Therefore set μ=m1modr\mu = -m^{-1} \bmod r, and we have

xr1x+m(xμmodr)r(modm)x r^{-1} \equiv \frac{x + m(x\mu \bmod r)}{r} \pmod m

Considering the value range, the expression above is less than 2m2m. But the intermediate value does not have to be normalized; we can be lazy and normalize only when mapping back out.

Transform

The input is usually [0,m)[0, m), so multiplying directly is enough. We can also use reduction, but it is not better.

Maybe Barrett can optimize this? I have not tested it; I guess the adjusted version might be faster.

Faster inverse modulo 2k2^k

When rr is a power of 22, the inverse can be computed with Newton iteration. Suppose xx is the inverse of aa modulo 2k2^k:

ax2kq+1(mod22k)ax \equiv 2^k q + 1 \pmod {2^{2k}}

Then

ax(2ax)122kq21(mod22k)ax(2-ax) \equiv 1 - 2^{2k}q^2 \equiv 1 \pmod {2^{2k}}

Implementation

struct Montgomery {
  u32 m, ir;
  static u32 inv_m(u32 m) {
    u32 x = 1;
    for (i32 i = 0; i < 5; ++i)
      x *= 2 - x * m;
    return x;
  }
  Montgomery(u32 m_) : m(m_), ir(-inv_m(m)) {}
  u32 tran(u32 a) const {
    return (u64(a) << 32) % m;
  }
  u32 val(u32 a) const { // itrans
    return redc_m(redc(a) - m);
  }
  u32 add(u32 a, u32 b) const {
    return redc_m2(a + b - m * 2);
  }
  u32 sub(u32 a, u32 b) const {
    return redc_m2(a - b);
  }
  u32 redc_m(u32 a) const {
    return a >> 31 ? a + m : a;
  }
  u32 redc_m2(u32 a) const {
    return a >> 31 ? a + m * 2 : a;
  }
  u32 redc(u64 a) const {
    return (a + u64(u32(a) * ir) * m) >> 32;
  }
  u32 mul(u32 a, u32 b) const {
    return redc(u64(a) * b);
  }
};

Others

@platelet mentioned the fixed-value multiplication optimization of Lemire Reduction in his blog post A Modulo Algorithm Twice as Fast as the Compiler’s Implementation. But division is too expensive. In a long nn NTT there are O(n)O(n) roots of unity to process, so it does not show an advantage. Barrett is actually a bit better.

Since division is expensive, maybe initialization can also be wrapped with Barrett? I will try it next time when I have time.

References