 How to speedup multiple broadcasts in Julia

This Julia function seems to be quite inefficient (an order of magnitude slower than the equivalent Pythran / C++ code, even after the Julia warmup)...

10 * (2*a.^2 + 4*a.^3) + 2 ./ a
end

arr = ones(1000, 1000)

I guess it is only that I don't write it correctly... How can one speedup such "multi broadcasts" in Julia? I guess/hope I don't need to expend the loops...

Thank you! With my setup, the Pythran solutions (in place and out of place) are still 1.5 to 2 times faster (without OpenMP). Is there a way to activate SIMD instructions in Julia? Or another way to speed up such CPU computations?

The Python code:

from transonic import jit

@jit
return 10 * (2*a**2 + 4*a**3) + 2 / a

@jit
a[:] = 10 * (2*a**2 + 4*a**3) + 2 / a

Edit after the @simd suggestion

It seems that @simd does not work out of the box, i.e. just by adding it at the beginning of the line.

Stacktrace:
 compile(::Expr, ::Bool) at ./simdloop.jl:54
 @simd(::LineNumberNode, ::Module, ::Any) at ./simdloop.jl:126
 include at ./boot.jl:317 [inlined]
 include(::Module, ::String) at ./sysimg.jl:29
 exec_options(::Base.JLOptions) at ./client.jl:231
 _start() at ./client.jl:425

I guess that one would have to expand the for loops, but then the code (i) becomes much less readable and (ii) is no longer independent of the dimension.

It seems that we have a case for which simple Python/Numpy code can get accelerated with Pythran faster than what we get with Julia (except if there is a way to accelerate this in Julia? and a future Julia version may solve this). Interesting...

@. 10 * (2*a^2 + 4*a^3) + 2 / a
end
my_multi_broadcast2 (generic function with 1 method)

The difference is that in 10 * (2*a.^2 + 4*a.^3) + 2 ./ a you actually do not take advantage of broadcast fusion as * and two + are not broadcasted.

Writing @. 10 * (2*a^2 + 4*a^3) + 2 / a is equivalent to 10 .* (2 .* a.^2 .+ 4 .* a.^3) .+ 2 ./ a.

And here is the comparison of performance

58.146 ms (18 allocations: 61.04 MiB)

5.982 ms (4 allocations: 7.63 MiB)

How does it compare to Pythran / C++, as we get roughly 10x speedup?

Finally note that if you could mutate arr in place by writing: