Optimising multiplication modulo a small prime
Question:
I need to do the following operation many times:
- Take two integers
a, b
- Compute
a * b mod p
, where p = 1000000007
and a, b
are of the same order of magnitude as p
My gut feeling is the naive
result = a * b
result %= p
is inefficient. Can I optimise multiplication modulo p
much like exponentiation modulo p
is optimised with pow(a, b, p)
?
Answers:
Although this is trivally simple, you could try it and save some time on the mod p
step by building a list of products based on 1000000007
(the size of the list depends on the size of a
and b
). Test for modulo on each of those (starting with the highest). Granted, this only helps if a & b >= sqrt(p) * 2
.
You mention that “a, b
are of the same order of magnitude as p.” Often in cryptography this means that a,b
are large numbers near p
, but strictly less-than p
.
If this is the case, then you could use the simple identity
to turn your calculation into
result = ((a-p)*(b-p))%p
You’ve then turned one large multiplication into two large subtractions and a small multiplication. You’ll have to profile to see which is faster.
To do this calculation in assembly, but have it callable from Python, I’d
try inline assembly from a
Python module written in C.
Both GCC and
MSVC
compilers feature inline assembly, only with differing syntax.
Note that our modulus p = 1000000007
just fits into 30-bits. The result
desired (a*b)%p
can be computed in Intel 80×86 registers given some weak
restrictions on a,b
not being much bigger than p
.
Restrictions on size of a,b
(1) a,b
are 32-bit unsigned integers
(2) a*b
is less than p << 32
, i.e. p
times 2^32
In particular if a,b
are each less than 2*p
, overflow will be avoided.
Given (1), it also suffices that either one of them is less than p
.
The Intel 80×86 instruction MUL can multiply two 32-bit unsigned integers
and store the 64-bit result in accumulator register pair EDX:EAX. Some
details and quirks of MUL are discussed in Section 10.2.1 of this helpful
summary.
The instruction DIV can then divide this 64-bit result by a 32-bit constant
(the modulus p
), storing the quotient in EAX and the remainder in EDX.
See Section 10.2.2 of the last link. The result we want is that remainder.
It is this division instruction DIV that entails a risk of overflow, should
the 64-bit product in numerator EDX:EAX give a quotient larger than 32-bits
by failing to satisfy (2) above.
I’m working on a code snippet in C/inline assembly for “proof of concept”.
However the maximum benefit in speed will depend on batching up arrays of
data a,b
to process, amortizing the overhead of function calls, etc. in
Python (if that is the target platform).
This doesn’t answer the question directly, but I would recommend not doing this in pure Python if you’re looking for performance. Some options:
- Make a small library in C that does your computations, and use Python’s
ctypes
to talk to it.
- Use numpy; probably the best option if you want to stay out of having to deal with compiling stuff yourself. Doing operations one at a time won’t be faster than Python’s own operators, but if you can put multiple ones in a numpy array, computations on them will be much faster than the equivalent in Python.
- Use cython to declare your variables as C integers; again, same as numpy, you will benefit from this the most if you do it in batches (because then you can also optimize the loop).
There may be a clue to the optimization if you clarified what you mean by many times, for example if you were collecting the results from a high frequency loop, the loop may offer the means to optimize your routine.
Say the unoptimized loop was:
p = 1000000007
b = 123456789
a = 0
while a < p:
result = (a * b) % p
dosomething(a, b, result)
a += 1
you could optimise out the * and % from the high frequency loop:
p = 1000000007
b = 123456789
a = 0
result = (a * b) % p
while a < p:
dosomething(a, b, result)
a += 1
result += b
if result >= p:
result -= p
I need to do the following operation many times:
- Take two integers
a, b
- Compute
a * b mod p
, wherep = 1000000007
anda, b
are of the same order of magnitude asp
My gut feeling is the naive
result = a * b
result %= p
is inefficient. Can I optimise multiplication modulo p
much like exponentiation modulo p
is optimised with pow(a, b, p)
?
Although this is trivally simple, you could try it and save some time on the mod p
step by building a list of products based on 1000000007
(the size of the list depends on the size of a
and b
). Test for modulo on each of those (starting with the highest). Granted, this only helps if a & b >= sqrt(p) * 2
.
You mention that “a, b
are of the same order of magnitude as p.” Often in cryptography this means that a,b
are large numbers near p
, but strictly less-than p
.
If this is the case, then you could use the simple identity
to turn your calculation into
result = ((a-p)*(b-p))%p
You’ve then turned one large multiplication into two large subtractions and a small multiplication. You’ll have to profile to see which is faster.
To do this calculation in assembly, but have it callable from Python, I’d
try inline assembly from a
Python module written in C.
Both GCC and
MSVC
compilers feature inline assembly, only with differing syntax.
Note that our modulus p = 1000000007
just fits into 30-bits. The result
desired (a*b)%p
can be computed in Intel 80×86 registers given some weak
restrictions on a,b
not being much bigger than p
.
Restrictions on size of a,b
(1) a,b
are 32-bit unsigned integers
(2) a*b
is less than p << 32
, i.e. p
times 2^32
In particular if a,b
are each less than 2*p
, overflow will be avoided.
Given (1), it also suffices that either one of them is less than p
.
The Intel 80×86 instruction MUL can multiply two 32-bit unsigned integers
and store the 64-bit result in accumulator register pair EDX:EAX. Some
details and quirks of MUL are discussed in Section 10.2.1 of this helpful
summary.
The instruction DIV can then divide this 64-bit result by a 32-bit constant
(the modulus p
), storing the quotient in EAX and the remainder in EDX.
See Section 10.2.2 of the last link. The result we want is that remainder.
It is this division instruction DIV that entails a risk of overflow, should
the 64-bit product in numerator EDX:EAX give a quotient larger than 32-bits
by failing to satisfy (2) above.
I’m working on a code snippet in C/inline assembly for “proof of concept”.
However the maximum benefit in speed will depend on batching up arrays of
data a,b
to process, amortizing the overhead of function calls, etc. in
Python (if that is the target platform).
This doesn’t answer the question directly, but I would recommend not doing this in pure Python if you’re looking for performance. Some options:
- Make a small library in C that does your computations, and use Python’s
ctypes
to talk to it. - Use numpy; probably the best option if you want to stay out of having to deal with compiling stuff yourself. Doing operations one at a time won’t be faster than Python’s own operators, but if you can put multiple ones in a numpy array, computations on them will be much faster than the equivalent in Python.
- Use cython to declare your variables as C integers; again, same as numpy, you will benefit from this the most if you do it in batches (because then you can also optimize the loop).
There may be a clue to the optimization if you clarified what you mean by many times, for example if you were collecting the results from a high frequency loop, the loop may offer the means to optimize your routine.
Say the unoptimized loop was:
p = 1000000007
b = 123456789
a = 0
while a < p:
result = (a * b) % p
dosomething(a, b, result)
a += 1
you could optimise out the * and % from the high frequency loop:
p = 1000000007
b = 123456789
a = 0
result = (a * b) % p
while a < p:
dosomething(a, b, result)
a += 1
result += b
if result >= p:
result -= p