Can someone explain to me this part of Dixon's factorization algorithm?

Question:

I’ve been trying to implement Dixon’s factorization method in python, and I’m a bit confused. I know that you need to give some bound B and some number N and search for numbers between sqrtN and N whose squares are B-smooth, meaning all their factors are in the set of primes less than or equal to B. My question is, given N of a certain size, what determines B so that the algorithm will produce non-trivial factors of N? Here is a wikipedia article about the algorithm, and if it helps, here is my code for my implementation:

def factor(N, B):
    def isBsmooth(n, b):
        factors = []
        for i in b:
            while n % i == 0:
                n = int(n / i)
                if not i in factors:
                    factors.append(i)
        if n == 1 and factors == b:
            return True
        return False

    factor1 = 1
    while factor1 == 1 or factor1 == N:
        Bsmooth = []
        BsmoothMod = []
        for i in range(int(N ** 0.5), N):
            if len(Bsmooth) < 2 and isBsmooth(i ** 2 % N, B):
                Bsmooth.append(i)
                BsmoothMod.append(i ** 2 % N)
        gcd1 = (Bsmooth[0] * Bsmooth[1]) % N
        gcd2 = int((BsmoothMod[0] * BsmoothMod[1]) ** 0.5)
        factor1 = gcd(gcd1 - gcd2, N)
        factor2 = int(N / factor1)
    return (factor1, factor2)

Maybe someone could help clean my code up a bit, too? It seems very inefficient.

Asked By: polarbits

||

Answers:

This article discusses the optimal size for B: https://web.archive.org/web/20160205002504/https://vmonaco.com/dixons-algorithm-and-the-quadratic-sieve/. Briefly, the optimal value is thought to be exp((logN loglogN)^(1/2)).

Answered By: vrume21

[ I wrote this for a different purpose, but you might find it interesting. ]

Given x2y2 (mod n) with x ≠ ± y, about half the time gcd(xy, n) is a factor of n. This congruence of squares, observed by Maurice Kraitchik in the 1920s, is the basis for several factoring methods. One of those methods, due to John Dixon, is important in theory because its sub-exponential run time can be proven, though it is too slow to be useful in practice.

Dixon’s method begins by choosing a bound be√(log n log log n) and identifying the factor base of all primes less than b that are quadratic residues of n (their jacobi symbol is 1).

function factorBase(n, b)
  fb := [2]
  for p in tail(primes(b))
    if jacobi(n, p) == 1
      append p to fb
  return fb

Then repeatedly choose an integer r on the range 1 < r < n, calculate its square modulo n, and if the square is smooth over the factor base add it to a list of relations, stopping when there are more relations than factors in the factor base, plus a small reserve for those cases that fail. The idea is to identify a set of relations, using linear algebra, where the factor base primes combine to form a square. Then take the square root of the product of all the factor base primes in the relations, take the product of the related r, and calculate the gcd to identify the factor.

struct rel(x, ys)

function dixon(n, fb, count)
  r, rels := floor(sqrt(n)), []
  while count > 0
    fs := smooth((r * r) % n, fb)
    if fs is not null
      append rel(r, fs) to rels
      count := count - 1
    r := r + 1
  return rels

A number n is smooth if all its factors are in the factor base, which is determined by trial division; the smooth function returns a list of factors, which is null if n doesn’t completely factor over the factor base.

function smooth(n, fb)
  fs := []
  for f in fb
    while n % f == 0
      append f to fs
      n := n / f
    if n == 1 return fs
  return []

A factor is determined by submitting the accumulated relations to the linear algebra of the congruence of square solver.

For example, consider the factorization of 143. Choose r = 17, so r2 ≡ 3 (mod 143). Then choose r = 19, so r2 ≡ 75 ≡ 3 · 52. Those two relations can be combined as (17 · 19)2 ≡ 32 · 52 ≡ 152 (mod 143), and the two factors are gcd(17·19 − 15, 143) = 11 and gcd(17·19 + 15, 143) = 13. This sometimes fails; for instance, the relation 212 ≡ 22 (mod 143) can be combined with the relation on 19, but the two factors produced, 1 and 143, are trivial.

Answered By: user448810

Thanks for very interesting question!

In pure Python I implemented from scratch Dixon Factorization Algorithm in 3 different flavors:

  1. Using simplest sieve. I’m creating u64 array with all numbers in range [N; N * 2), which signify z^2 value. This array hold result of multiplication of prime numbers. Then through sieving process I iterate all factor base prime numbers and do array[k] *= p in those k positions that are divisible by p. Finally when sieved array is ready I check both that a) array index k is a perfect square, b) and array[k] == k - N. Second b) condition means that all multiplied p primes give final number, this is only true if number is divisible only by factor-base primes, i.e. it is B-smooth. This is simplest and most slowest out of my 3 solutions.

  2. Second solution uses SymPy library to factorize every z^2. I iterate all possible z and do sympy.factorint(z * z), this gives factorization of z^2. If this factorization contains only small primes, i.e. from factor base, then I collect such z and z^2 for later processing. This version of algorithm is also slow, but much faster than first one.

  3. Third solution uses a kind of sieving used in Quadratic Sieve. This sieving process is fastest of all three algorithms. Basically what it does, it finds all roots of equation x^2 = N (mod p) for all primes in factor base, as I have just few primes root finding is done through simple loop through all variants, for bigger primes one can use Shanks Tonelli algorithm of finding root, which is really fast. Only around 50% of primes give a root solution at all, hence only half of primes are actually used in Quadratic Sieve. Roots of such equation can be used to generate lots of solutions at once, because root + k * p is also a valid solution for all k. Sieving is done through array[offset(root) :: p] += Log2(p). Here instead of multiplication of first algorithm I used adding a logarithm of prime. First it is a bit faster to add a number than to multiply. Secondly, what is more important is that it supports any size of number, e.g. even 256-bit. While multiplying is possible only till 64-bit number, because Numpy has no 128 or 256 bit integers support. After logartithms are added, I check which logarithms are equal to logarithm of original z^2 number, this numbers are final sieved numbers.

After all three algorithms above have sieved all z^2 then I do Linear Algebra stage through Gaussian Elemination algorithm. This stage is meant to find such combination of B-smooth z^2 numbers which after multiplication of their prime factors give final number with all EVEN prime powers.

Lets call a Relation a triple z, z^2, prime factors of z^2. Basically all relations are given to Gaussian Elemination stage, where even combinations are found.

Even powers of prime numbers give us equality a^2 = b^2 (mod N), from where we can get a factor by doing factor = GCD(a + b, N), here GCD is Greatest Common Divisor found through Euclidean Algorithm. This GCD sometimes gives trivial factors 1 and N, in this case other even combinations should be checked.

To be 100% sure to get even combinations I do Sieving stage till I find a bit more than amount of prime numbers amount of relations, actually around 105% of amount of prime numbers. This extra 5% of relations ensure us that we certainly will get dependent linear equations in Gaussian stage. All these dependent equation form even combinations.

Actually we need a bit more dependent equations, not just 1 more than amount of primes, but around 5%-10% more, only because some (50-60% of them as I can see experimentally) dependencies give only trivial factor 1 or N. Hence extra equations are needed.

Put a look at console output at the end of my post. This console output shows all the impressions from my program. There I run in parallel (multi-threaded) both 2nd (Sieve_B) and 3rd (Sieve_C) algorithms. 1st one (Sieve_A) is not run by my program because it is so slow that you’ll wait forever for it to finish.

At the very end of source file you can tweak variable bits = 64 to some other size, like bits = 96. This is amount of bits in composite number N. This N is created as a product of just two random prime numbers of equal size. Such a composite consisting of two equal in size primes is usually called RSA Number.

Also find B = 1 << 10, this tells degree of B-smoothness, basically factor base consists of all possible primes < B. You may increase this B limit, this will give more frequent answers of sieved z^2 hence whole factoring becomes much faster. The only limitation of huge size of B is Linear Algebra stage (Gaussian Elemination), because with bigger factor base you have to solve more linear equations of bigger size. And my Gauss is done not in very optimal way, for example instead of keeping bits as np.uint8 you may keep bits as dense np.uint64, this will increase Linear Algebra speed by 8x times more.

You may also find variable M = 1 << 23, which tells how large is sieving array size, in other words it is block size that is processed at once. Bigger block is a bit faster, but not much. Bigger values of M will not give much difference because it only tells what size of tasks sieving process is split into, it doesn’t influence any computation power. More than that bigger M will occupy more memory, so you can’t increases it infinitely, only till you have enough memory.

Besides all mentioned above algorithms I also used Fermat Primality Test, also Sieve of Eratosthenes (for generating prime factor base).

Plus also implemented my own algorithm of filtering square numbers. For this I take some composite modulus that looks close to Primorial, like mod = 2 * 2 * 2 * 3 * 3 * 5 * 7 * 11 * 13. And inside boolean array I mark all numbers modulus mod that are squares. Later when any number K should be checked if it is square or not I get flag_array[K % mod] and if it is True then number is "Possibly" squares, while if it is False then number is "Definitely" not square. Thus this filter gives false positives sometimes but never false negatives. This filter checking stage filters out 95% of non-squares, remaining 5% of possibly squares can be double-checked through math.isqrt().

Please, click below on Try it online! link, to test run my program on online server of ReplIt. This will give you best impression, especially if you have no Python or no personal laptop. My code below can be just run straight away after only PIP-installing python -m pip numpy sympy.

Try it online!

import threading

def GenPrimes_SieveOfEratosthenes(end):
    import numpy as np
    composites = np.zeros((end,), dtype = np.uint8)
    for p in range(2, len(composites)):
        if composites[p]:
            continue
        if p * p >= end:
            break
        composites[p * p :: p] = 1
    primes = []
    for p in range(2, len(composites)):
        if not composites[p]:
            primes.append(p)
    return np.array(primes, dtype = np.uint32)
    
def Print(*pargs, __state = (threading.RLock(),), **nargs):
    with __state[0]:
        print(*pargs, flush = True, **nargs)
    
def IsSquare(n, *, state = []):
    if len(state) == 0:
        import numpy as np
        Print('Pre-computing squares filter...')
        squares_filter = 2 * 2 * 2 * 3 * 3 * 5 * 7 * 11 * 13
        squares = np.zeros((squares_filter,), dtype = np.uint8)
        squares[(np.arange(0, squares_filter, dtype = np.uint64) ** 2) % squares_filter] = 1
        state.extend([squares_filter, squares])
    if not state[1][n % state[0]]:
        return False, None
    import math
    root = math.isqrt(n)
    return root ** 2 == n, root

def FactorRef(x):
    import sympy
    return dict(sorted(sympy.factorint(x).items()))

def CheckZ(z, N, primes):
    z2 = pow(z, 2, N)
    factors = FactorRef(z2)
    assert all(p <= primes[-1] for p in factors), (primes[-1], factors, N, z, z2)
    return z
    
def SieveSimple(N, primes):
    import time, math, numpy as np
    
    Print('Simple Sieve of B-smooth z^2...')
    
    sieve_block = 1 << 21
    rep0_time = 0
    for iiblock, iblock in enumerate(range(N, N * 2, sieve_block)):
        if time.time() - rep0_time >= 30:
            Print(f'Block {iiblock:>3} (2^{math.log2(max(iblock - N, 1)):>5.2f})')
            rep0_time = time.time()
        iblock_end = iblock + sieve_block
        sieve_arr = np.ones((sieve_block,), dtype = np.uint64)
        iblock_modN = iblock % N
        for p in primes:
            mp = 1
            while True:
                if mp * p >= sieve_block:
                    break
                mp *= p
                off = (mp - iblock_modN % mp) % mp
                sieve_arr[off :: mp] *= p
        for i in range(1 if iblock == N else 0, sieve_block):
            num = iblock + i
            z2 = num - N
            if sieve_arr[i] < z2:
                continue
            assert sieve_arr[i] == z2, (sieve_arr[i], round(math.log2(sieve_arr[i]), 3), z2)
            is_square, z = IsSquare(num)
            if not is_square:
                continue
            #Print('z', z, 'z^2', z2)
            yield CheckZ(z, N, primes)

def SieveFactor(N, primes):
    import math
    Print('Factor Sieve of B-smooth z^2...')
    for iz, z in enumerate(range(math.isqrt(N - 1) + 1, math.isqrt(N * 2 - 1) + 1)):
        z2 = z ** 2 - N
        assert 0 <= z2 and z2 < N, (z, z2)
        factors = FactorRef(z2)
        if any(p > primes[-1] for p in factors):
            continue
        #Print('iz', iz, 'z', z, 'z^2', z2, 'z^2 factors', factors)
        yield CheckZ(z, N, primes)
        
def BinarySearch(begin, end, Test):
    while begin + 1 < end:
        mid = (begin + end - 1) >> 1
        if Test(mid):
            end = mid + 1
        else:
            begin = mid + 1
    assert begin + 1 == end and Test(begin), (begin, end, Test(begin))
    return begin
    
def ModSqrt(n, p):
    n %= p
    def Ret(x):
        if pow(x, 2, p) != n:
            return []
        nx = (p - x) % p
        if x == nx:
            return [x]
        elif x <= nx:
            return [x, nx]
        else:
            return [nx, x]
    #if p % 4 == 3 and sympy.isprime(p):
    #    return Ret(pow(n, (p + 1) // 4, p))
    for i in range(p):
        if pow(i, 2, p) == n:
            return Ret(i)
    return []

def SieveQuadratic(N, primes):
    import math, numpy as np
    # https://en.wikipedia.org/wiki/Quadratic_sieve
    # https://www.rieselprime.de/ziki/Multiple_polynomial_quadratic_sieve
    M = 1 << 23
    def Log2I(x):
        return int(round(math.log2(max(1, x)) * (1 << 24)))
    def Log2IF(li):
        return li / (1 << 24)
    Print('Quadratic Sieve of B-smooth z^2...')
    plogs = {}
    for p in primes:
        plogs[int(p)] = Log2I(int(p))
    qprimes = []
    B = int(primes[-1]) + 1
    for p in primes:
        p = int(p)
        res = []
        mp = 1
        while True:
            if mp * p >= B:
                break
            mp *= p
            roots = ModSqrt(N, mp)
            if len(roots) == 0:
                if mp == p:
                    break
                continue
            res.append((mp, tuple(roots)))
        if len(res) > 0:
            qprimes.append(res)
    qprimes_lin = np.array([pinfo[0][0] for pinfo in qprimes], dtype = np.uint32)
    yield qprimes_lin
    
    Print('QSieve num primes', len(qprimes), f'({len(qprimes) * 100 / len(primes):.1f}%)')
    x_begin0 = math.isqrt(N - 1) + 1
    assert N <= x_begin0 ** 2
    for iblock in range(1 << 30):
        if (x_begin0 + (iblock + 1) * M) ** 2 - N >= N:
            break
        x_begin = x_begin0 + iblock * M
        if iblock != 0:
            Print('n', end = '')
        Print(f'Block {iblock} (2^{math.log2(max(1, x_begin ** 2 - N)):>6.2f})...')
        a = np.zeros((M,), np.uint32)
        for pinfo in qprimes:
            p = pinfo[0][0]
            plog = np.uint32(plogs[p])
            for imp, (mp, roots) in enumerate(pinfo):
                off_done = set()
                for root in roots:
                    for off in range(mp):
                        if ((x_begin + off) ** 2 - N) % mp == 0 and off not in off_done:
                            break
                    else:
                        continue
                    a[off :: mp] += plog
                    off_done.add(off)
        
        logs = np.log2(np.array((np.arange(M).astype(np.float64) + x_begin) ** 2 - N, dtype = np.float64))
        logs2if = Log2IF(a.astype(np.float64))
        logs_diff = np.abs(logs - logs2if)
        
        for ix in range(M):
            if logs_diff[ix] > 0.3:
                continue
            z = x_begin + ix
            z2 = z * z - N
            factors = FactorRef(z2)
            assert all(p <= primes[-1] for p, c in factors.items())
            #Print('iz', ix, 'z', z, 'z^2', z2, f'(2^{math.log2(max(1, z2)):>6.2f})', ', z^2 factors', factors)
            yield CheckZ(z, N, primes)
        
def LinAlg(N, zs, primes):
    import numpy as np
    Print('Linear algebra...')
    Print('Factoring...')
    m = np.zeros((len(zs), len(primes) + len(zs)), dtype = np.uint8)
    def SwapRows(i, j):
        t = np.copy(m[i])
        m[i][...] = m[j][...]
        m[j][...] = t[...]
    def MatToStr(m):
        s = 'n'
        for i in range(len(m)):
            for j in range(len(m[i])):
                s += str(m[i, j])
            s += 'n'
        return s[1:-1]
    for iz, z in enumerate(zs):
        z2 = z * z - N
        fs = FactorRef(z2)
        for p, c in fs.items():
            i = np.searchsorted(primes, p, 'right') - 1
            assert i >= 0 and i < len(primes) and primes[i] == p, (i, primes[i])
            m[iz, i] = (int(m[iz, i]) + c) % 2
        m[iz, len(primes) + iz] = 1
    Print('Gaussian elemination...')
    #Print(MatToStr(m)); Print()
    one_col, one_rows = 0, 0
    while True:
        while True:
            for i in range(one_rows, len(m)):
                if m[i, one_col]:
                    break
            else:
                one_col += 1
                if one_col >= len(primes):
                    break
                continue
            break
        if one_col >= len(primes):
            break
        assert m[i, one_col]
        assert np.all(m[i, :one_col] == 0)
        for j in range(len(m)):
            if i == j:
                continue
            if not m[j, one_col]:
                continue
            m[j][...] ^= m[i][...]
        SwapRows(one_rows, i)
        one_rows += 1
        one_col += 1
    assert np.all(m[one_rows:, :len(primes)] == 0)
    zeros = m[one_rows:, len(primes):]
    Print(f'Even combinations ({len(m) - one_rows}):')
    Print(MatToStr(zeros))
    return zeros
    
def ProcessResults(N, zs, la_zeros):
    import math
    Print('Computing final results...')
    factors = []
    for i in range(len(la_zeros)):
        zero = la_zeros[i]
        assert len(zero) == len(zs)
        cz = []
        for j in range(len(zero)):
            if not zero[j]:
                continue
            z = zs[j]
            z2 = z * z - N
            cz.append((z, z2, FactorRef(z2)))
        a = 1
        for z, z2, fs in cz:
            a = (a * z) % N
        cnts = {}
        for z, z2, fs in cz:
            for p, c in fs.items():
                cnts[p] = cnts.get(p, 0) + c
        cnts = dict(sorted(cnts.items()))
        b = 1
        for p, c in cnts.items():
            assert c % 2 == 0, (p, c, cnts)
            b = (b * pow(p, c // 2, N)) % N
        factor = math.gcd(a + b, N)
        Print('a', str(a).rjust(len(str(N))), '  b', str(b).rjust(len(str(N))), '  factor', factor if factor != N else 'N')
        if factor != 1 and factor != N:
            factors.append(factor)
    return factors

def SieveCollectResults(N, its):
    import time, threading, queue, traceback, math
    K = len(its)
    qs = [queue.Queue() for i in range(K)]
    last_dot, finish = False, False
    def Get(it, ty, need, compul):
        nonlocal last_dot, finish
        try:
            cnt = 0
            for iz, z in enumerate(it):
                if finish:
                    break
                if iz < 4:
                    z2 = z * z - N
                    Print(('n' if last_dot else '') + 'Sieve_' + ('C', 'B', 'A')[K - 1 - ty], '  iz', iz,
                        'z', z, 'z^2', z2, f'(2^{math.log2(max(1, z2)):>6.2f})', ', z^2 factors', FactorRef(z2))
                    last_dot = False
                else:
                    Print(('.', 'b', 'a')[K - 1 - ty], end = '')
                    last_dot = True
                qs[ty].put(z)
                cnt += 1
                if cnt >= need:
                    break
        except:
            Print(traceback.format_exc())
    thr = []
    for ty, (it, need, compul) in enumerate(its):
        thr.append(threading.Thread(target = Get, args = (it, ty, need, compul), daemon = True))
        thr[-1].start()
    for ithr, t in enumerate(thr):
        if its[ithr][2]:
            t.join()
    finish = True
    if last_dot:
        Print()
    zs = [[] for i in range(K)]
    for iq, q in enumerate(qs):
        while not qs[iq].empty():
            zs[iq].append(qs[iq].get())
    return zs

def DixonFactor(N):
    import time, math, numpy as np, sys
    B = 1 << 10
    primes = GenPrimes_SieveOfEratosthenes(B)
    Print('Num primes', len(primes), 'last prime', primes[-1])
    IsSquare(0)
    it = SieveQuadratic(N, primes)
    qprimes = next(it)
    zs = SieveCollectResults(N, [
        #(SieveSimple(N, primes), 3, False),
        (SieveFactor(N, primes), 3, False),
        (it, round(len(qprimes) * 1.06 + 0.5), True),
    ])[-1]
    la_zeros = LinAlg(N, zs, qprimes)
    fs = ProcessResults(N, zs, la_zeros)
    if len(fs) > 0:
        Print('Factored, factors', sorted(set(fs)))
    else:
        Print('Failed to factor! Try running program again...')

def IsPrime_Fermat(n, *, ntrials = 32):
    import random
    if n <= 16:
        return n in (2, 3, 5, 7, 11, 13)
    for i in range(ntrials):
        if pow(random.randint(2, n - 2), n - 1, n) != 1:
            return False
    return True

def GenRandom(bits):
    import random
    return random.randrange(1 << (bits - 1), 1 << bits)

def RandPrime(bits):
    while True:
        n = GenRandom(bits) | 1
        if IsPrime_Fermat(n):
            return n

def Main():
    import math
    bits = 64
    N = RandPrime(bits // 2) * RandPrime((bits + 1) // 2)
    Print('N to factor', N, f'(2^{math.log2(N):>5.1f})')
    DixonFactor(N)

if __name__ == '__main__':
    Main()

Console output:

N to factor 10086068308526249063 (2^ 63.1)
Num primes 172 last prime 1021
Pre-computing squares filter...
Quadratic Sieve of B-smooth z^2...
Factor Sieve of B-smooth z^2...
QSieve num primes 78 (45.3%)
Block 0 (2^ 32.14)...
Sieve_C   iz 0 z 3175858067 z^2 6153202727426 (2^ 42.48) , z^2 factors {2: 1, 29: 2, 67: 1, 191: 1, 487: 1, 587: 1}
Sieve_C   iz 1 z 3175859246 z^2 13641877439453 (2^ 43.63) , z^2 factors {31: 1, 61: 1, 167: 1, 179: 1, 373: 1, 647: 1}
Sieve_C   iz 2 z 3175863276 z^2 39239319203113 (2^ 45.16) , z^2 factors {31: 1, 109: 1, 163: 1, 277: 1, 311: 1, 827: 1}
Sieve_C   iz 3 z 3175867115 z^2 63623612174162 (2^ 45.85) , z^2 factors {2: 1, 29: 1, 41: 1, 47: 1, 61: 1, 127: 1, 197: 1, 373: 1}
.........................................................................
Sieve_B   iz 0 z 3175858067 z^2 6153202727426 (2^ 42.48) , z^2 factors {2: 1, 29: 2, 67: 1, 191: 1, 487: 1, 587: 1}
......
Linear algebra...
Factoring...
Gaussian elemination...
Even combinations (7):
01000000000000000000000000000000000000000000000000001100000000000000000000000000000
11010100000010000100100000010011100000000001001001001001011001000000110001010000000
11001011000101111100011111001011010011000111101000001001011000001111100101001110000
11010010010000110110101100110101000100001100010011100011101000100010011011001001000
00010110111010000010000010000111010001010010111001000011011011101110110001001100100
00000010111000110010100110001111010101001000011010110011101000110001101101100100010
10010001111111101100011110111110110100000110111011010001010001100000010100000100001
Computing final results...
a  9990591196683978238   b  9990591196683978238   factor 1
a   936902490212600845   b  3051457985176300292   factor 3960321451
a  1072293684177681642   b  8576178744296269655   factor 2546780213
a  1578121372922149955   b  1578121372922149955   factor 1
a  2036768191033218175   b  8049300117493030888   factor N
a  1489997751586754228   b  2231890938565281666   factor 3960321451
a  9673227070299809069   b  3412883990935144956   factor 3960321451
Factored, factors [2546780213, 3960321451]
Answered By: Arty