How can numpy be so much faster than my Fortran routine? How can numpy be so much faster than my Fortran routine? python python

How can numpy be so much faster than my Fortran routine?


Your Fortran implementation suffers two major shortcomings:

  • You mix IO and computations (and read from the file entry by entry).
  • You don't use vector/matrix operations.

This implementation does perform the same operation as yours and is faster by a factor of 20 on my machine:

program test  integer gridsize,unit  real mini,maxi,mean  real, allocatable :: tmp (:,:,:)  gridsize=512  unit=40  allocate( tmp(gridsize, gridsize, gridsize))  open(unit=unit,file='T.out',status='old',access='stream',&       form='unformatted',action='read')  read(unit=unit) tmp  close(unit=unit)  mini = minval(tmp)  maxi = maxval(tmp)  mean = sum(tmp)/gridsize**3  print *, mini, maxi, meanend program

The idea is to read in the whole file into one array tmp in one go. Then, I can use the functions MAXVAL, MINVAL, and SUM on the array directly.


For the accuracy issue: Simply using double precision values and doing the conversion on the fly as

mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))

only marginally increases the calculation time. I tried performing the operation element-wise and in slices, but that did only increase the required time at the default optimization level.

At -O3, the element-wise addition performs ~3 % better than the array operation. The difference between double and single precision operations is less than 2% on my machine - on average (the individual runs deviate by far more).


Here is a very fast implementation using LAPACK:

program test  integer gridsize,unit, i, j  real mini,maxi  integer  :: t1, t2, rate  real, allocatable :: tmp (:,:,:)  real, allocatable :: work(:)!  double precision :: mean  real :: mean  real :: slange  call system_clock(count_rate=rate)  call system_clock(t1)  gridsize=512  unit=40  allocate( tmp(gridsize, gridsize, gridsize), work(gridsize))  open(unit=unit,file='T.out',status='old',access='stream',&       form='unformatted',action='read')  read(unit=unit) tmp  close(unit=unit)  mini = minval(tmp)  maxi = maxval(tmp)!  mean = sum(tmp)/gridsize**3!  mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))  mean = 0.d0  do j=1,gridsize    do i=1,gridsize      mean = mean + slange('1', gridsize, 1, tmp(:,i,j),gridsize, work)    enddo !i  enddo !j  mean = mean / gridsize**3  print *, mini, maxi, mean  call system_clock(t2)  print *,real(t2-t1)/real(rate)end program

This uses the single precision matrix 1-norm SLANGE on matrix columns. The run-time is even faster than the approach using single precision array functions - and does not show the precision issue.


The numpy is faster because you wrote much more efficient code in python (and much of the numpy backend is written in optimized Fortran and C) and terribly inefficient code in Fortran.

Look at your python code. You load the entire array at once and then call functions that can operate on an array.

Look at your fortran code. You read one value at a time and do some branching logic with it.

The majority of your discrepancy is the fragmented IO you have written in Fortran.

You can write the Fortran just about the same way as you wrote the python and you'll find it runs much faster that way.

program test  implicit none  integer :: gridsize, unit  real :: mini, maxi, mean  real, allocatable :: array(:,:,:)  gridsize=512  allocate(array(gridsize,gridsize,gridsize))  unit=40  open(unit=unit, file='T.out', status='old', access='stream',&       form='unformatted', action='read')  read(unit) array      maxi = maxval(array)  mini = minval(array)  mean = sum(array)/size(array)  close(unit)end program test