Problem:
In a project involving squaring using fast bignum, the implementation with NTT is slow, even for large numbers (12000 bits or more).
Question:
Solution:
Separating the NTT_fast function into two functions, one with WW[] and the other with iWW[], leads to one less parameter in recursion calls. This provides a small performance improvement.
The original code used a series of branching and safety checks, which can be eliminated by utilizing bit tricks and rearranging the main loop. The assembly code for modmul was also modified for efficiency.
class fourier_NTT { // ... // Modular arithmetics (optimized, but it works only for p >= 0x80000000!!!) inline uint32 modadd(uint32 a, uint32 b) { uint32 d; //d = a + b; //!if (d < a) // d -= p; //!if (d >= p) // d -= p; __asm { mov eax, a; add eax, b } return d; } inline uint32 modsub(uint32 a, uint32 b) { uint32 d; //d = a - b; //!if (a < b) // a - b < 0 when a < b, so no need to check //! d += p; __asm { mov eax, a; sub eax, b } return d; } inline uint32 modmul(uint32 a, uint32 b) { uint32 _a = a; uint32 _b = b; __asm { mov eax, a; mul b; mov ebx, eax; mov eax, 2863311530; mov ecx, edx; mul edx; shld edx, eax, 1; mov eax, 3221225473; mul edx; sub ebx, eax; mov eax, 3221225473; sbb ecx, edx; jc addback; neg ecx; and ecx, eax; sub ebx, ecx; sub ebx, eax; sbb edx, edx; and eax, edx; addback: add eax, ebx; }; } // ... };
Conclusion
These optimizations result in a 1.35x speedup of the NTT transform and modular arithmetics. The code is provided in C and inline assembly.
The above is the detailed content of How Can We Optimize NTT Transform and Modular Arithmetic for Faster Bignum Squaring?. For more information, please follow other related articles on the PHP Chinese website!