I need to speed up the calculation of a linear transform which is roughly of the following form:

```
import numpy as np
N=10000
input=np.random.random(N)
x=np.linspace(0,100,N)
y=np.linspace(0,30,N)
X,Y=np.meshgrid(x,y,sparse=True)
output=np.dot(np.cos(X*Y),input)
```

That is, I evaluate the cosine on a regular grid and multiply my input by the resulting matrix. In reality, the kernel function (here, the cosine) is more complicated, in particular it is not periodic. Hence, no simplification of FFT-type is possible!

The above transform takes about 5 seconds on my multi-core machine. Now, I definitely need to speed this up. A simple first try is to use numexpr:

```
import numpy as np
import numexpr as ne
N=10000
input=np.random.random(N)
x=np.linspace(0,100,N)
y=np.linspace(0,30,N)
X,Y=np.meshgrid(x,y,sparse=True)
output=np.dot(ne.evaluate('cos(X*Y)'),input)
```

This makes use of parallel computing and reduces the execution time to about 0.9 seconds. This is quite nice, but not sufficient for my purpose. So, my next try is to employ parallel Cython:

```
import numpy as np
from cython.parallel import prange
cimport numpy as np
cimport cython
from libc.math cimport cos
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def transform(double[:] x, double[:] y, double[:] input):
cdef unsigned int N = x.shape[0]
cdef double[:] output = np.zeros(N)
cdef unsigned int row, col
for row in prange(N, nogil= True):
for col in range(N):
output[row] += cos(x[row]*y[col])*input[col]
return output
```

I compile this by executing

```
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_modules=[
Extension("cythontransform",
["cythontransform.pyx"],
libraries=["m"],
extra_compile_args = ["-O3", "-ffast-math", "-march=native", "-fopenmp" ],
extra_link_args=['-fopenmp']
)
]
setup(
name = "cythontransform",
cmdclass = {"build_ext": build_ext},
ext_modules = ext_modules
)
```

from the command line. Calling the transform via

```
import numpy as np
from cythontransform import transform
N=10000
input=np.random.random(N)
x=np.linspace(0,100,N)
y=np.linspace(0,30,N)
output=transform(x,y,input)
```

yields a rather weak improvement, giving roughly 0.7 seconds.

Is someone aware of the possibility of further improvements of the Cython code?

Or, alternatively, is there some other framework (PyOpenCL, Pythran, Numba, ...) that is better suited to this problem?

On my laptop, the following pythran version:

```
#pythran export transform(float64[], float64[], float64[])
import numpy as np
def transform(x, y, input):
N = x.shape[0]
output = np.zeros(N)
#omp parallel for
for row in range(N):
for col in range(N):
output[row] += np.cos(x[row]*y[col])*input[col]
return output
```

Compiled with

```
pythran python -Ofast dd.py -fopenmp
```

Runs roughly two times faster than the cython version you propose. I did not investigate why this happens though...

Posted on by serge-sans-paille