Cython C++ wrapping benchmarks

A couple of weeks ago, I found the C++ wrapping benchmarks on the PyBindGen homepage. The author posted a short blog intro about them. I wondered why Cython wasn't used as a comparison at the time, until I found out that the wrapper was rather tricky to write in Cython back then due to the lack of good C++ language support (especially for overloaded functions/methods).

Cython has improved its C++ support considerably since then, due to the work of Danilo Freitas and Robert Bradshaw, which was recently merged into mainline and is now scheduled for Cython 0.13. This allowed me to provide a simple and short implementation of the wrapper module used in the above benchmark. The timings are rather unsurprising: Cython beats them all. This is mainly due to the fact that Cython uses highly optimised argument handling code, which greatly reduces the call overhead of a wrapper.

I also like how readable the Cython wrapper code is, especially compared to the rather unwieldy PyBindGen implementation. Obviously, this comparison is a bit unfair because Cython is a programming language with an optimising compiler, whereas the other tools are simply glue code generators. But the benchmark results certainly speak volumes.

A faster Python implementation? Think Cython!

I noticed that there is a whole set of talks on alternative Python implementations at EuroPython this year. From a quick glance, there seems to be mainly one missing: Cython.

"What?", I hear you think, "Cython? Isn't that just a tool for extending CPython?". Well, yes, it is a tool for extending CPython. However, when you think about it the other way round, it actually is a Python implementation that only falls back to CPython for stuff that it doesn't want to do itself, or that it doesn't support yet. Everything else runs in plain C code and only uses parts of CPython that are not worth reimplementing, namely the object model and implementation, the fast container types, and the standard library. It will switch to CPython's eval loop only for Python modules that are not compiled.

Cython even has an on-the-fly compilation mode (pyximport) that can be used to compile Python modules (e.g. standard library modules or external dependencies) into fast C modules transparently on import. This is basically a JIT compiler that automatically falls back to CPython's byte code interpretation if the compilation fails for some reason.

The dependency on CPython (any version from 2.3-3.1) has many advantages for Python users. One is that you get 100% Python compatibility by definition, as CPython is always a part of Cython. This includes the complete standard library, all existing Python software, and all existing C extensions, with which you can sometimes even interact directly at the fast C level (e.g. Numpy, lxml.etree and others). Apart from CPython itself, there is no other Python implementation that currently achieves this.

On top of that, it's trivial to optimise pure Python code into type annotated Cython code (even in pure Python syntax) to speed up certain code sections by factors of several 100 times (1000 times and more is not unheard of). Running cython -a will generate a highlighted HTML representation of your code that shows where type annotations may lead to a speed up. There is no need to change all your code to get that speedup, just concentrate exactly on those sections that need raw speed - usually inner loops and tight algorithms. Or just call into a C, C++ or Fortran library that does the job fast enough already, even if you are not an expert in that language.

And another really cool feature: using Cython will let your code benefit from enhancements and optimisations in both CPython and Cython. Whenever any of the two projects finds a way to make the built-in types or the generated C code faster, it's your code that will become faster. Whenever someone writes a new module or extension for CPython, you can just import it without fearing compatibility issues. Whenever the Python language or the Cython language adds a new syntax feature, you can start using it right away, without waiting for other implementations to catch up. And we do have tons of ideas about stunning features and optimisations that we want to add to the Cython compiler.

So, you can either sit and wait for your code to get optimised for you, or you can get your own hands dirty now and join a very dynamic, open and friendly project that constantly makes Cython faster, better and simpler to use.

lxml is Google's Top-5 for "elementtree"!

I just noticed that lxml reached rank 5 on Google when you look for "elementtree", just after two links for ElementTree itself and another two for the Python standard library, so it's more of a rank 3!

Yahoo sees us within the top 10.

MSN at least underlines the ElementTree compatibility of lxml in its Top-5

although it doesn't find our homepage is good enough for rank 1 when you ask it for "lxml".

But then again, Microsoft isn't the first place to ask for OpenSource software anyway...

Dummheit beerdigt

München bekommt keinen Transrapid. Damit ist diese Geschichte aus stoiberscher Geltungssucht und Steuerzahlerbetrug hoffentlich ad-acta gelegt.

Eine Frage bleibt: Wenn jetzt schon Ministrecken wie die zwischen München Hauptbahnhof und Flughafen ernsthaft in Erwägung gezogen werden, was spricht dann eigentlich gegen eine Strecke Lisboa-​Madrid-​Bordeaux-​Paris-​Bruxelles-​Köln-​Hannover-​Berlin-​Posnan-​Warszawa-​Kaunas-​Riga-​Talinn? Optional mit Verlängerung nach St. Petersburg. Wäre das nicht eine Strecke, die vernünftig von einem Hochgeschwindigkeitszug befahren werden könnte? In, sagen wir mal, zwölf Stunden für die runden 4400 Kilometer? Nur würde da wahrscheinlich der AGV als schnellstes Schienenfahrzeug in Sachen Geschwindigkeit (und wohl auch Kosten) immer noch am Transrapid vorbei ziehen. Aber gegen eine entsprechende AGV-Strecke hätte ich natürlich auch nichts einzuwenden...

Roman Herzogs Angst um die CSU

Roman Herzog trauert in der Sueddeutschen Zeitung der guten alten Zeit der Vier-Parteien-Republik nach. Der wunderbaren Zeit, als die Welt noch in Ordnung, die Regierungsmehrheiten noch stabil, und die CSU noch Weltmacht war. Und die bösen Linken noch nicht den großen Parteien die Stimmen wegschnappten.

Tja, Herr Herzog, wie wäre es denn, wenn Sie, anstatt gleich nach einem neuen Wahlrecht zu schreien, einfach mal Ihre eigene Partei bitten würden, die Grabenkriege zu beenden, und das Gut-Böse-Lagerdenken gegen eine themenorientierte Politik einzutauschen? Das würde dieses Land sicherlich weiterbringen, als durch eine Wahlrechtsreform die eine oder andere Mehrheit künstlich herbeizuführen.

Optimising Cython code

There was a request on the Cython mailing list on how to optimise Cython code. Here's how I do it.

We have some Cython code that we want to benchmark:

x = 1

for i from 0 

Ok, obviously this is stupid code, as this can be done much easier without a loop. But let's say for the sake of argument that this is the best algorithm that we can come up with, and that we have a suspicion that it might not run as fast as we think it should.

First thing to do is to make that suspicion evidence by benchmarking. So I copy the code over to a Cython module and wrap it in a Python function:

# file: bench.pyx

def run(max):
    x = 1
    for i from 0 

Now we compile the file:

# cython bench.pyx

# gcc -shared $CFLAGS -I/usr/include/python2.5 -o bench.c

And run it through Python's great timeit module:

# python -m timeit -s 'from bench import run' 'run(100)'

1000000 loops, best of 3: 5.93 usec per loop

This looks exceedingly long-running to me ;)

Since I have no idea what to do better, I first look trough the generated C code. That's not as hard as it sounds, as Cython copies the original Cython code into comments and marks the line that it generates code for. The loop code gets translated into this:

  /* ".../TEST/bench.pyx":3
 * def run(max):
 *     x = 1
 *     for i from 0

The code I stripped (/* ... */) is error handling code. It's emitted in one long line so that it's easy to ignore - which is the best thing to do with it.

What you can see here is that Cython is smart enough to optimise the loop into a C loop with a C run variable (type long), but then the unsuspiciously looking operator '+' uses Python API calls, so this is not what I had in mind when I wrote the code. I wanted it to be as fast and straight forward as it looks in Cython. However, Cython cannot know my intention here, as my code might as well depend on the semantics of Python's '+' operator (which is different from the '+' operator in C).

Cython's way of dealing with Python/C type ambiguity is explicit static type declarations through cdefs. By default, all variables are defined as if I had written cdef object variable, but in this case, I want them to be plain C integers. So here is the straight forward way to tell Cython that I want the variables x and i to have C semantics rather than Python semantics:

# file: bench.pyx

def run(max):
    cdef int i,x
    x = 1
    for i from 0 

And the resulting C code shows me that Cython understood what I wanted:

  /* ".../TEST/bench.pyx":4
 *     cdef int i,x
 *     x = 1
 *     for i from 0 

Now timeit gives me something like this:

# python -m timeit -s 'from bench import run' 'run(100)'

1000000 loops, best of 3: 0.284 usec per loop

That's about a factor of 20 compared to the original example. And that's definitely enough for today.