# Optimising multiplication modulo a small prime

## Question:

I need to do the following operation *many* times:

- Take two integers
`a, b`

- Compute
`a * b mod p`

, where`p = 1000000007`

and`a, b`

are of the same order of magnitude as`p`

My gut feeling is the naive

```
result = a * b
result %= p
```

is inefficient. Can I optimise multiplication modulo `p`

much like exponentiation modulo `p`

is optimised with `pow(a, b, p)`

?

## Answers:

Although this is trivally simple, you could try it and save some time on the `mod p`

step by building a list of products based on `1000000007`

(the size of the list depends on the size of `a`

and `b`

). Test for modulo on each of those (starting with the highest). Granted, this only helps if `a & b >= sqrt(p) * 2`

.

You mention that *“ a, b are of the same order of magnitude as p.”* Often in cryptography this means that

`a,b`

are large numbers near `p`

, but strictly less-than `p`

.If this is the case, then you could use the simple identity

to turn your calculation into

```
result = ((a-p)*(b-p))%p
```

You’ve then turned one large multiplication into two large subtractions and a small multiplication. You’ll have to profile to see which is faster.

To do this calculation in assembly, but have it callable from Python, I’d

try inline assembly from a

Python module written in C.

Both GCC and

MSVC

compilers feature inline assembly, only with differing syntax.

Note that our modulus `p = 1000000007`

just fits into 30-bits. The result

desired `(a*b)%p`

can be computed in Intel 80×86 registers given some weak

restrictions on `a,b`

not being much bigger than `p`

.

**Restrictions on size of a,b**

(1) `a,b`

are 32-bit unsigned integers

(2) `a*b`

is less than `p << 32`

, i.e. `p`

times 2^32

In particular if `a,b`

are each less than `2*p`

, overflow will be avoided.

Given (1), it also suffices that either one of them is less than `p`

.

The Intel 80×86 instruction MUL can multiply two 32-bit unsigned integers

and store the 64-bit result in accumulator register pair EDX:EAX. Some

details and quirks of MUL are discussed in Section 10.2.1 of this helpful

summary.

The instruction DIV can then divide this 64-bit result by a 32-bit constant

(the modulus `p`

), storing the quotient in EAX and the remainder in EDX.

See Section 10.2.2 of the last link. The result we want is that remainder.

It is this division instruction DIV that entails a risk of overflow, should

the 64-bit product in numerator EDX:EAX give a quotient larger than 32-bits

by failing to satisfy (2) above.

I’m working on a code snippet in C/inline assembly for “proof of concept”.

However the maximum benefit in speed will depend on batching up arrays of

data `a,b`

to process, amortizing the overhead of function calls, etc. in

Python (if that is the target platform).

This doesn’t answer the question directly, but I would recommend not doing this in pure Python if you’re looking for performance. Some options:

- Make a small library in C that does your computations, and use Python’s
`ctypes`

to talk to it. - Use numpy; probably the best option if you want to stay out of having to deal with compiling stuff yourself. Doing operations one at a time won’t be faster than Python’s own operators, but if you can put multiple ones in a numpy array, computations on them will be much faster than the equivalent in Python.
- Use cython to declare your variables as C integers; again, same as numpy, you will benefit from this the most if you do it in batches (because then you can also optimize the loop).

There may be a clue to the optimization if you clarified what you mean by ** many** times, for example if you were collecting the results from a high frequency loop, the loop may offer the means to optimize your routine.

Say the unoptimized loop was:

```
p = 1000000007
b = 123456789
a = 0
while a < p:
result = (a * b) % p
dosomething(a, b, result)
a += 1
```

you could optimise out the * and % from the high frequency loop:

```
p = 1000000007
b = 123456789
a = 0
result = (a * b) % p
while a < p:
dosomething(a, b, result)
a += 1
result += b
if result >= p:
result -= p
```