Optimising Cython code

There was a request on the Cython mailing list on how to optimise Cython code. Here’s how I do it.

We have some Cython code that we want to benchmark:

x = 1
for i from 0 <= i < MAX:
    x = x + i

Ok, obviously this is stupid code, as this can be done much easier without a loop. But let’s say for the sake of argument that this is the best algorithm that we can come up with, and that we have a suspicion that it might not run as fast as we think it should.

First thing to do is to make that suspicion evidence by benchmarking. So I copy the code over to a Cython module and wrap it in a Python function:

# file: bench.pyx
def run(max):
    x = 1
    for i from 0 <= i < max:
        x = x + i
    return x

Now we compile the file:

# cython bench.pyx
# gcc -shared $CFLAGS -I/usr/include/python2.5 -o bench.so bench.c

And run it through Python’s great timeit module:

# python -m timeit -s 'from bench import run' 'run(100)'
1000000 loops, best of 3: 5.93 usec per loop

This looks exceedingly long-running to me ;)

Since I have no idea what to do better, I first look trough the generated C code. That’s not as hard as it sounds, as Cython copies the original Cython code into comments and marks the line that it generates code for. The loop code gets translated into this:

  /* ".../TEST/bench.pyx":3
 * def run(max):
 *     x = 1
 *     for i from 0 <= i < max:             # <<<<<<<<<<<<<<
 *         x = x + i
 *     return x
 */
  __pyx_2 = __pyx_PyInt_long(__pyx_v_max); if /* ... */
  for (__pyx_1 = 0; __pyx_1 < __pyx_2; __pyx_1++) {
    __pyx_3 = PyInt_FromLong(__pyx_1); if /* ... */
    Py_DECREF(__pyx_v_i);
    __pyx_v_i = __pyx_3;
    __pyx_3 = 0;

    /* ".../TEST/bench.pyx":4
 *     x = 1
 *     for i from 0 <= i < max:
 *         x = x + i             # <<<<<<<<<<<<<<
 *     return x
 */
    __pyx_3 = PyNumber_Add(__pyx_v_x, __pyx_v_i); if /* ... */
    Py_DECREF(__pyx_v_x);
    __pyx_v_x = __pyx_3;
    __pyx_3 = 0;
  }

The code I stripped (/* ... */) is error handling code. It’s emitted in one long line so that it’s easy to ignore - which is the best thing to do with it.

What you can see here is that Cython is smart enough to optimise the loop into a C loop with a C run variable (type long), but then the unsuspiciously looking operator ‘+’ uses Python API calls, so this is not what I had in mind when I wrote the code. I wanted it to be as fast and straight forward as it looks in Cython. However, Cython cannot know my intention here, as my code might as well depend on the semantics of Python’s ‘+’ operator (which is different from the ‘+’ operator in C).

Cython’s way of dealing with Python/C type ambiguity is explicit static type declarations through cdefs. By default, all variables are defined as if I had written cdef object variable, but in this case, I want them to be plain C integers. So here is the straight forward way to tell Cython that I want the variables x and i to have C semantics rather than Python semantics:

# file: bench.pyx
def run(max):
    cdef int i,x
    x = 1
    for i from 0 <= i < max:
        x = x + i
    return x

And the resulting C code shows me that Cython understood what I wanted:

  /* ".../TEST/bench.pyx":4
 *     cdef int i,x
 *     x = 1
 *     for i from 0 <= i < max:             # <<<<<<<<<<<<<<
 *         x = x + i
 *     return x
 */
  __pyx_1 = __pyx_PyInt_int(__pyx_v_max); if /* ... */
  for (__pyx_v_i = 0; __pyx_v_i < __pyx_1; __pyx_v_i++) {

    /* ".../TEST/bench.pyx":5
 *     x = 1
 *     for i from 0 <= i < max:
 *         x = x + i             # <<<<<<<<<<<<<<
 *     return x
 */
    __pyx_v_x = (__pyx_v_x + __pyx_v_i);
  }

Now timeit gives me something like this:

# python -m timeit -s 'from bench import run' 'run(100)'
1000000 loops, best of 3: 0.284 usec per loop

That’s about a factor of 20 compared to the original example. And that’s definitely enough for today.

2 Responses to “Optimising Cython code”

  1. robertwb Says:

    I bet if you did run(1000) or run(10000) you would get much more than a 20x speedup… Of your 0.284 microseconds per loop, 0.25+ of that is in making the python function call.

  2. Stefan Behnel Says:

    Sure, good point. Anyway, it’’s pretty far from a real-world example, and the speed-up that users get for their own code will vary heavily. This is just to show them how to get more speed out of Cython when they need it.

Leave a Reply