How can I improve performance of this Fortran code interfacing with Python?

Question

I am studying the possibility to interface Fortran code to be used in Python.
I know of f2py but, since I did not succeed in using it with external libraries (such as lapack), I reverted to use ctypes.
I have everything working, but I noticed this behavior: for the same algorithm (a simple multiplication of two matrices element-wise) the Fortran code is slower than numpy, while I expected it to be faster.
I tried with two different PC, one with Windows 10 and the other with Windows 11.
I used gfortran from MinGW-64 installed through winlibs.com version gcc-12.2.0-llvm-15.0.7-mingw-w64ucrt-10.0.0-r4, my Python version is 3.9.13

The measured times with timeit are (for matrices 1000×1000)

With Numpy: 2.77 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With Fortran: 12.4 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The Fortran code is in a file "libfortran3.f90"

subroutine prodotto(a, b, n) bind(c, name='prodotto')
  ! declarations
  use iso_c_binding, only: c_double, c_int

  integer(c_int), intent(in) :: n
  real(c_double), dimension(n,n), intent(in) :: b
  real(c_double), dimension(n,n), intent(inout) :: a
  
  do i = 1,n
    do j = 1,n
        a(i,j) = a(i,j)*b(i,j)
    end do
  end do
  
end subroutine prodotto

and it is compiled with gfortran -O3 -funroll-loops -ffast-math -fPIC -shared -o libfortran3.so libfortran3.f90.
Please note that I used the same matrix for input and output so I do not have to pass another matrix for the result (since I expect that this worsen the problem).

The Python code (run in a notebook) is

from ctypes import CDLL, POINTER, c_int, c_double
import numpy as np
import time
import sys

fortran = CDLL('./libfortran3.so')
fortran.prodotto.argtypes = [ POINTER(c_double), 
                                 POINTER(c_double), 
                                 POINTER(c_int)]
fortran.prodotto.restype = None
#
N = 10
A = np.random.rand(N,N)
B = np.random.rand(N,N)
#
print('With Numpy')
%timeit A*B
#
print('With Fortran')
A = np.asfortranarray(A, dtype=c_double)
B = np.asfortranarray(B, dtype=c_double)
Act = A.ctypes.data_as(POINTER(c_double))
Bct = B.ctypes.data_as(POINTER(c_double))
%timeit fortran.prodotto( Act, Bct, c_int(N) )

Of course my Fortran code is not as optimized as Numpy, but I would like to know if there is something which I could modify in order to achieve better performances.

Edit April, 3rd, 2023

Thanks to mkrieger1 and janneb, indeed I made a poor decision in the nested loop! Switching i with j in "libfortran3.f90" gives me this timing (with matrices 1000×1000):

With Numpy: 2.67 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With Fortran: 1.01 ms ± 34.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I also tried with f2py, with the following code in file "libfortran1.f90"

subroutine prodotto(a, b, c, n) 
  ! declarations

  integer, intent(in) :: n
  real(kind=8), dimension(n,n), intent(in) :: a, b
  real(kind=8), dimension(n,n), intent(out) :: c
  
  do j = 1,n
    do i = 1,n
        c(i,j) = a(i,j)*b(i,j)
    end do
  end do
  
end subroutine prodotto

compiled with python -m numpy.f2py --compiler=mingw32 --fcompiler=gfortran -c libfortran1.f90 -m libfortran1, running it in the notebook gives me:

With f2py: 12.3 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

which is way slower than pure numpy.

Just for completeness, if I try to compile this fortran code which uses Lapack (which I built on my system) in "libfortran0.f90"

subroutine risolvi(A,b,x,n) bind(c, name='risolvi')
  ! declarations
  use iso_c_binding, only: c_double, c_int

  integer(c_int), intent(in) :: n
  real(c_double), dimension(n,n), intent(inout) :: A
  real(c_double), dimension(n), intent(inout) :: b
  real(c_double), dimension(n), intent(out) :: x
  integer(c_int), dimension(n) :: pivot(n)
  integer(c_int) :: ok
  ! call LAPACK subroutine
  call DGESV(n, 1, A, n, pivot, b, n, ok) 
  x = b
  
end subroutine risolvi

compiled with gfortran -O3 -funroll-loops -ffast-math -fPIC -shared -L. -lblas -llapack -o libfortran0.so libfortran0.f90, and I compare with numpy

from ctypes import CDLL, POINTER, c_int, c_double
import numpy as np
import time
import sys
import libfortran1

fortran = CDLL('./libfortran0.so')
fortran.risolvi.argtypes = [ POINTER(c_double), 
                                 POINTER(c_double), 
                                 POINTER(c_double), 
                                 POINTER(c_int)]
fortran.risolvi.restype = None
#
N = 1000
A = np.random.rand(N,N)
b = np.random.rand(N)
#
print('With Numpy')
%timeit np.linalg.solve(A,b)
#
print('With Fortran')
A = np.asfortranarray(A, dtype=c_double)
b = np.asfortranarray(b, dtype=c_double)
x = np.asfortranarray(np.zeros(N), dtype=c_double)
Act = A.ctypes.data_as(POINTER(c_double))
bct = b.ctypes.data_as(POINTER(c_double))
xct = x.ctypes.data_as(POINTER(c_double))
fortran.risolvi( Act, bct, xct, c_int(N) )
%timeit fortran.risolvi( Act, bct, xct, c_int(N) )

I obtain the following timimgs (with N=1000)

With Numpy: 13.6 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With Fortran: 221 ms ± 9.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Anyway, with N=30 the use of Fortran is faster

With Numpy: 15.6 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

With Fortran: 11.4 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Is it possible that there is something wrong in my code when passing large arrays?
Or is just that, as I read, "ctypes is made for compatibility, not speed"?

Asked By: Federico Cluni

||

Source

Answer 1

As mentioned in a comment, you’re looping the wrong way in your Fortran code, as Fortran is column-major. That being said, in this case since it’s an element by element multiplication, you can just replace the nested do loop with a simple

a = a * b

Secondly, N=10 is quite short, and the extra overhead to call an external function obscures the performance of the actual code.

Setting N=100 and the replacement of the manual loop with the array multiplication, I get

With Numpy
5.00679704999493
With Fortran
3.699547360010911

Answered By: janneb

How can I improve performance of this Fortran code interfacing with Python?

Question:

Answers: