 Speeding up a linear transform using parallel Cython

I need to speed up the calculation of a linear transform which is roughly of the following form:

import numpy as np

N=10000
input=np.random.random(N)

x=np.linspace(0,100,N)
y=np.linspace(0,30,N)
X,Y=np.meshgrid(x,y,sparse=True)

output=np.dot(np.cos(X*Y),input)

That is, I evaluate the cosine on a regular grid and multiply my input by the resulting matrix. In reality, the kernel function (here, the cosine) is more complicated, in particular it is not periodic. Hence, no simplification of FFT-type is possible!

The above transform takes about 5 seconds on my multi-core machine. Now, I definitely need to speed this up. A simple first try is to use numexpr:

import numpy as np
import numexpr as ne

N=10000
input=np.random.random(N)

x=np.linspace(0,100,N)
y=np.linspace(0,30,N)
X,Y=np.meshgrid(x,y,sparse=True)

output=np.dot(ne.evaluate('cos(X*Y)'),input)

This makes use of parallel computing and reduces the execution time to about 0.9 seconds. This is quite nice, but not sufficient for my purpose. So, my next try is to employ parallel Cython:

import numpy as np
from cython.parallel import prange

cimport numpy as np
cimport cython
from libc.math cimport cos

DTYPE = np.float64
ctypedef np.float64_t DTYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def transform(double[:] x, double[:] y, double[:] input):

cdef unsigned int N = x.shape
cdef double[:] output = np.zeros(N)
cdef unsigned int row, col

for row in prange(N, nogil= True):
for col in range(N):
output[row] += cos(x[row]*y[col])*input[col]

return output

I compile this by executing

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules=[
Extension("cythontransform",
["cythontransform.pyx"],
libraries=["m"],
extra_compile_args = ["-O3", "-ffast-math", "-march=native", "-fopenmp" ],
)
]

setup(
name = "cythontransform",
cmdclass = {"build_ext": build_ext},
ext_modules = ext_modules
)

from the command line. Calling the transform via

import numpy as np
from cythontransform import transform

N=10000
input=np.random.random(N)

x=np.linspace(0,100,N)
y=np.linspace(0,30,N)

output=transform(x,y,input)

yields a rather weak improvement, giving roughly 0.7 seconds.

Is someone aware of the possibility of further improvements of the Cython code?

Or, alternatively, is there some other framework (PyOpenCL, Pythran, Numba, ...) that is better suited to this problem?

On my laptop, the following pythran version:

#pythran export transform(float64[], float64[], float64[])
import numpy as np

def transform(x, y, input):
N = x.shape
output = np.zeros(N)

#omp parallel for
for row in range(N):
for col in range(N):
output[row] += np.cos(x[row]*y[col])*input[col]
return output

Compiled with

pythran python -Ofast dd.py -fopenmp

Runs roughly two times faster than the cython version you propose. I did not investigate why this happens though...