Writing C code is a premature optimisation

It seems I can't repeat this often enough. People who write Python wrappers for libraries in plain C "because it's faster" tend to overestimate their C-API skills and simply have no idea how costly maintenance is. It's like the old advice about optimisation: Don't do it! (and, if you're an expert: Don't do it yet!). If you write your wrapper code in C instead of Cython, it will be
  • slower
  • less portable
  • harder to maintain
  • harder to extend
  • harder to optimise
  • harder to debug and fix

It will cost you a lot of effort, both short term and long term, that is much better spent in adding cool features and optimising the performance critical parts of your code once you got it working. Say, is your time really so cheap that you want to waste it writing C code?

Stecker raus

München sollte öfter mal ohne Strom dastehen. Ich kann mich nicht erinnern, dass die Autos schon jemals dermaßen zivilisiert durch die Stadt gefahren sind.

A static Python compiler? What's the point?

I've finally found the time to look through the talks of this year's EuroPython (which I didn't attend - I mean, Firenze? In plain summer? Seriously?). That made me stumble over a rather lengthy talk by Kay Hayen about his Nuitka compiler project. It took more than an hour, almost one and a half. I had to skip ahead through the video more than once. Certainly reminded me that it's a good idea to keep my own talks short.

Apparently, there was a mixed reception of that talk. Some people seemed to be heavily impressed, others didn't like it at all. According to the comments, Guido was more in the latter camp. I can understand that. The way Kay presented his project was not very convincing. The only "excuse" he had for its existence was basically "I do it in my spare time" and "I don't like the alternatives". In the stream of details that he presented, he completely failed to make the case for a static Python compiler at all. And Guido's little remark in his keynote that "some people still try to do this" showed once again that this case must still be made.

So, what's the problem with static Python compilers, compared to static compilers for other languages? Python can obviously be translated into static code, the mere fact that it can be interpreted shows that. Simply chaining all code that the interpreter executes will yield a static code representation. However, that doesn't answer the question whether it's worth doing. The interpreter in CPython is a much more compact piece of code than the result of such a translation would be, and it's also much simpler. The trace pruning that HotPy does, according to another talk at the same conference, is a very good example for the complexity involved. The fact that ShedSkin and PyPy's RPython do explicitly not try to implement the whole Python language speaks volumes. And the overhead of an additional compilation step is actually something that drives many people to use the Python interpreter in the first place. Static compilation is not a virtue. Thus, I would expect an excuse for writing a static translator from anyone who attempts it. The normal excuse that people bring forward is "because it's faster". Faster than interpretation.

Now, Python is a dynamic language, which makes static translation difficult already, but it's a dynamic language where side-effects are the normal case rather than an exception. That means that static analysis and optimisation can never be as effective as runtime analysis and optimisation, not with a resonable effort. At least WPA (whole program analysis) would be required in order to make static optimisations as effective as runtime optimisations, but both ShedSkin and RPython make it clear that this can only be done for a limited subset of the language. And it obviously requires the whole program to be available at compile time, which is usually not the case, if only due to the excessive resource requirements of a WPA. PyPy is a great example, compiling its RPython sources takes tons of memory and a ridiculous amount of time.

That's why I don't think that "because it's faster" catches it, not as plain as that. The case for a static compiler must be that "it solves a problem". Cython does that. People don't use Cython because it has such a great Python code optimiser. Plain, unmodified Python code compiled by Cython, while usually faster than interpretation in CPython, will often be slower and sometimes several times slower than what PyPy's JIT driven optimiser gets out of it. No, people use Cython because it helps them solve a problem. Which is either that they want to connect to external non-Python libraries from Python code or that they want to be able to manually optimise their code, or both. It's manual code optimisation and tuning where static compilers are great. Runtime optimisers can't give you that and interpreters obviously won't give you that either. The whole selling point of Cython is not that it will make Python code magically run fast all by itself, but that it allows users to tremendously expand the range of manual optimisations that they can apply to their Python code, up to the point where it's no longer Python code but essentially C code in a Python-like syntax, or even plain C code that they interface with as if it was Python code. And this works completely seamlessly, without building new language barriers along the way.

So, the point is not that Cython is a static Python compiler, the point is that it is more than a Python compiler. It solves a problem in addition to just being a compiler. People have been trying to write static compilers for Python over and over again, but all of them fail to provide that additional feature that can make them useful to a broad audience. I don't mind them doing that, having fun writing code is a perfectly valid reason to do it. But they shouldn't expect others to start raving about the result, unless they can provide more than just static compilation.

Apples, oranges and tomatoes

People keep asking how Cython and PyPy compare performance-wise, and which they should choose. This is my answer.

To ask which is faster, CPython, PyPy or Cython, outside of a very well defined and specific context of existing code and requirements, is basically comparing apples, oranges and tomatoes. Any of the three can win against the others for the right kind of applications (apple sauce on your pasta, anyone?). Here's a rule-of-thumb kind of comparison that may be way off for a given piece of code but should give you a general idea.

Note that we're only talking about CPU-bound code here. I/O-bound code will only show a difference in some very well selected cases (e.g. because Cython allows you to step down into low-level minimum-copy I/O using C, in which case it may not really have been I/O bound before).

PyPy is very fast for pure Python code that generally runs in loops for a while and makes heavy use of Python objects. It's great for computational code (and often way faster than CPython for it) but has its limits for numerics, huge data sets and other seriously performance critical code because it doesn't really allow you to fine-tune your code. Like any JIT compiler, it's a black box where you put something in and either you like the result or not. That equally applies to the integration with native code through the ctypes library, where you can be very lucky, or not. Although the platform situation keeps improving, the PyPy platform still lacks a wide range of external libraries that are available for the CPython platform, including many tools that people use to speed up their Python code.

CPython is usually quite a bit faster than PyPy for one-shot scripts (especially when including the startup time) and more generally for code that doesn't benefit from long-running loops. For example, I was surprised to see how much slower it is to run something as large as the Cython compiler inside of PyPy to compile code, despite being written in pure Python code. CPython is also very portable and extensible (especially using Cython) and has a much larger set of external (native) libraries available than the PyPy platform, including all of NumPy and SciPy, for example. However, its performance looses against PyPy for most pure Python applications that keep doing the same stuff for a while, without resorting to native code or optimised native libraries for the heavy lifting.

Cython is very fast for low-level computations, for (thread-)parallel code and for code that benefits from switching seamlessly between C/C++ and Python. The main feature is that it allows for very fine grained manual code tuning from pure Python to C-ish Python to C to external libraries. It is designed to extend a Python runtime, not to replace it. When used to extend CPython, it obviously inherits all advantages of that platform in terms of available code. It's usually way slower than PyPy for the kind of object-heavy pure Python code in which PyPy excels, including some kinds of computational code, even if you start optimising the code manually. Compared to CPython, however, Cython compiled pure Python code usually runs faster and it's easy to make it run much faster.

So, for an existing (mostly) pure Python application, PyPy is generally worth a try. It's usually faster than CPython and often fast enough all by itself. If it's not, well, then it's not and you can go and file a bug report with them. Or just drop it and happily ignore that it exists from that point on. Or just ignore it entirely in the first place, because your application runs fast enough anyway, so why change anything about it?

However, for most other, non-trivial applications, the simplistic question "which platform is faster" is much less important in real life. If an application has (existing or anticipated) non-trivial external dependencies that are not available or do not work reliably in a given platform, then the choice is obvious. And if you want to (or have to) optimise and tune the code yourself (where it makes sense to do that), the combination of CPython and Cython is often more rewarding, but requires more manual work than a quick test run in PyPy. For cases where most of the heavy lifting is being done in some C, C++, Fortran or other low-level library, either platform will do, often with a "there's already a binding for it" advantage for CPython and otherwise a usability and tunability advantage for Cython when the binding needs to be written from scratch. Apples, oranges and tomatoes, if you only ask which is faster.

Another thing to consider is that CPython and PyPy can happily communicate with each other from separate processes. So, there are ways to let applications benefit from both platforms at the same time when the need arises. Even heterogeneous MPI setups might be possible.

There is also work going on to improve the new integration of Cython with PyPy, which allows to compile and run Cython code on the PyPy platform. The performance of that interface currently suffers from the lack of optimisation in PyPy's cpyext emulation layer, but that should get better over time. The main point for now is that the integration lifts the platform lock-in for both sides, which makes more native code available for both platforms.

XML parser performance in PyPy

I recently showed some benchmark results comparing the XML parser performance in CPython 3.3 to that in PyPy 1.7. Here's an update for PyPy 1.9 that also includes the current state of the lxml port to that platform, parsing a 3.4MB document style XML file.

CPython 3.3pre:

Initial Memory usage: 11332

xml.etree.ElementTree.parse done in 0.041 seconds

Memory usage: 21468 (+10136)

xml.etree.cElementTree.parse done in 0.041 seconds

Memory usage: 21464 (+10132)

xml.etree.cElementTree.XMLParser.feed(): 25317 nodes read in 0.041 seconds

Memory usage: 21736 (+10404)

lxml.etree.parse done in 0.032 seconds

Memory usage: 28324 (+16992)

drop_whitespace.parse done in 0.030 seconds

Memory usage: 25172 (+13840)

lxml.etree.XMLParser.feed(): 25317 nodes read in 0.037 seconds

Memory usage: 30608 (+19276)

minidom tree read in 0.492 seconds

Memory usage: 29852 (+18520)

PyPy without JIT warming:

Initial Memory usage: 42156

xml.etree.ElementTree.parse done in 0.452 seconds

Memory usage: 44084 (+1928)

xml.etree.cElementTree.parse done in 0.450 seconds

Memory usage: 44080 (+1924)

xml.etree.cElementTree.XMLParser.feed(): 25317 nodes read in 0.457 seconds

Memory usage: 47920 (+5768)

lxml.etree.parse done in 0.033 seconds

Memory usage: 58688 (+16536)

drop_whitespace.parse done in 0.033 seconds

Memory usage: 55536 (+13384)

lxml.etree.XMLParser.feed(): 25317 nodes read in 0.055 seconds

Memory usage: 64724 (+22564)

minidom tree read in 0.541 seconds

Memory usage: 59456 (+17296)

PyPy with JIT warmup:

Initial Memory usage: 646824

xml.etree.ElementTree.parse done in 0.341 seconds

xml.etree.cElementTree.parse done in 0.345 seconds

xml.etree.cElementTree.XMLParser.feed(): 25317 nodes read in 0.342 seconds

lxml.etree.parse done in 0.026 seconds

drop_whitespace.parse done in 0.025 seconds

lxml.etree.XMLParser.feed(): 25317 nodes read in 0.039 seconds

minidom tree read in 0.383 seconds

What you can quickly see is that lxml performs equally well on both (actually slightly faster on PyPy) and beats the other libraries on PyPy by more than an order of magnitude. The absolute numbers are fairly low, though, way below a second for the 3.4MB file. It'll be interesting to see some more complete benchmarks at some point that also take some realistic processing into account.

Remark: cElementTree is just an alias for the plain Python ElementTree on PyPy and ElementTree uses cElementTree in the background in CPython 3.3, which is why both show the same performance. The memory sizes were measured in forked processes, whereas the PyPy JIT numbers were measured in a repeatedly running process in order to take advantage of the JIT compiler. Note the substantially higher memory load of PyPy here.

Update: I originally reported the forked memory size with the non-forked performance for PyPy. The above now shows both separately. A more real-world comparison would likely yield an even higher memory usage on PyPy than the numbers above, which were mostly meant to give an idea of the memory usage of the in-memory tree (i.e. the data impact).

XML parser performance in CPython 3.3 and PyPy 1.7

In a recent article, I compared the performance of MiniDOM and the three ElementTree implementations ElementTree, cElementTree and lxml.etree for parsing XML in CPython 3.3. Given the utterly poor performance of the pure Python library MiniDOM in this competition, I decided to give it another chance and tried the same in PyPy 1.7. Because lxml.etree and cElementTree are not available on this platform, I only ran the tests with plain ElementTree and MiniDOM. I also report the original benchmark results for CPython below for comparison.

Parser performance of XML libraries in CPython 3.3 and PyPy 1.7

While I also provide numbers regarding the memory usage of each library in this comparison, they are not directly comparable between PyPy and CPython because of the different memory management of both platforms and because the overall memory that PyPy uses right from the start is much larger than for CPython. So the relative increase in memory may or may not be an accurate way to tell what each runtime does with the memory. However, it appears that PyPy manages to kill at least the severe memory problems of MiniDOM, as the total amount of memory used for the larger files is several times smaller than that used by CPython.

Memory usage of XML trees in CPython 3.3 and PyPy 1.7

So, what do I take from this benchmark? If you have legacy MiniDOM code lying around, you want PyPy to run it. It exhibits several times better performance in terms of memory and runtime. It also performs substantially better for ElementTree than the plain Python ElementTree in CPython.

However, for fast XML processing in general, the better performance of PyPy even for plain Python ElementTree is not really all that interesting, because it is still several times slower than cElementTree or lxml.etree in CPython. That means that you will often be able to process multiple files in CPython in the time that you need for just one in PyPy, even if your actual application code that does the processing manages to get a substantial JIT speed-up in PyPy. Even worse, the GIL in PyPy will keep your code from getting a parallel speedup that you usually get with multi-threaded processing in lxml and CPython, e.g. in a web server setting.

So, as always, the decision depends on what your actual application does and which library it uses. Do your own benchmarks.

XML parser performance in Python 3.3

For a recent bug ticket about MiniDOM, I collected some performance numbers that compare it to ElementTree, cElementTree and lxml.etree under a recent CPython 3.3 developer build, all properly compiled and optimised for Linux 64bit, using os.fork() and the resource module to get a clean measure of the memory usage for the in-memory tree. Here are the numbers:

Parsing hamlet.xml in English, 274KB:

Memory usage: 7284

xml.etree.ElementTree.parse done in 0.104 seconds

Memory usage: 14240 (+6956)

xml.etree.cElementTree.parse done in 0.022 seconds

Memory usage: 9736 (+2452)

lxml.etree.parse done in 0.014 seconds

Memory usage: 11028 (+3744)

minidom tree read in 0.152 seconds

Memory usage: 30360 (+23076)

Parsing the old testament in English (ot.xml, 3.4MB) into memory:

Memory usage: 20444

xml.etree.ElementTree.parse done in 0.385 seconds

Memory usage: 46088 (+25644)

xml.etree.cElementTree.parse done in 0.056 seconds

Memory usage: 32628 (+12184)

lxml.etree.parse done in 0.041 seconds

Memory usage: 37500 (+17056)

minidom tree read in 0.672 seconds

Memory usage: 110428 (+89984)

A 25MB XML file with Slavic Unicode text content:

Memory usage: 57368

xml.etree.ElementTree.parse done in 3.274 seconds

Memory usage: 223720 (+166352)

xml.etree.cElementTree.parse done in 0.459 seconds

Memory usage: 154012 (+96644)

lxml.etree.parse done in 0.454 seconds

Memory usage: 135720 (+78352)

minidom tree read in 6.193 seconds

Memory usage: 604860 (+547492)

And a contrived 4.5MB XML file with a lot more structure than data and no whitespace at all:

Memory usage: 11600

xml.etree.ElementTree.parse done in 3.374 seconds

Memory usage: 203420 (+191820)

xml.etree.cElementTree.parse done in 0.192 seconds

Memory usage: 36444 (+24844)

lxml.etree.parse done in 0.131 seconds

Memory usage: 62648 (+51048)

minidom tree read in 5.935 seconds

Memory usage: 527684 (+516084)

I also took the last file and pretty printed it, thus adding lots of indentation whitespace that increased the file size to 6.2MB. Here are the numbers for that:

Memory usage: 13308

xml.etree.ElementTree.parse done in 4.178 seconds

Memory usage: 222088 (+208780)

xml.etree.cElementTree.parse done in 0.478 seconds

Memory usage: 103056 (+89748)

lxml.etree.parse done in 0.199 seconds

Memory usage: 101860 (+88552)

minidom tree read in 8.705 seconds

Memory usage: 810964 (+797656)

Yes, 2MB of whitespace account for almost 300MB more memory in MiniDOM.

Here are the graphs:

XML tree memory usage in Python 3.3 for lxml, ElementTree, cElementTree and MiniDOM

XML perser performance in Python 3.3 for lxml, ElementTree, cElementTree and MiniDOM

I think it is pretty clear that minidom has basically left the scale, whereas cElementTree and lxml.etree are pretty close to each other. lxml tends to be a tad faster, and cElementTree tends to use a little less memory.

PyCon-DE 2011

Die erste PyCon-DE überhaupt ist zu Ende gegangen. Sie war ein Riesenerfolg, sowohl für mich selbst als auch nach allem, was ich so von den anderen Teilnehmern gehört habe. Etliche interessante Vorträge aus unterschiedlichsten Bereichen, eine Menge Leute die ich entweder gerne wieder getroffen habe, immer schon mal treffen wollte, oder mit denen ich noch nie etwas zu tun aber nun einige interessante Diskussionen hatte. Die ganze Organisation lief wie am Schnürchen und sogar das Essen war ebenso gut wie abwechslungsreich.

Eines der wichtigsten Ergebnisse der Konferenz war die Gründung des Python Software Verband e.V. als Nachfolger der ehemals Zope-spezifischen DZUG. Die Neuausrichtung wird es wesentlich erleichtern, die deutschsprachige Python-Gemeinde unter ein gemeinsames Dach zu bringen, und die Python-Lobby in Deutschland, Österreich und der Schweiz zu verbessern.

Ich selbst habe zwei Vorträge über Cython und lxml gehalten, sowie ein Tutorial zu Cython. Alle wurden mit großem Interesse aufgenommen (wobei ich noch auf die konkreten Rückmeldungen zum Tutorial warte) und gaben Anlass zu einigen interessanten Diskussionen. Cython und lxml sind weiterhin zwei Best-of-Breed Tools, und große Themen in der Python-Gemeinde. Besonders lxml hat mir einiges an Schulterklopfen dafür eingebracht, dass ich es in den letzten Jahren zu dem einen großen XML-Tool für Python gemacht habe. Paul Everitt, der eine Keynote hielt und den ich eigentlich immer schon mal treffen wollte (hiermit passiert), hat sogar mitten in seinem Vortrag eine Riesenfolie aufgelegt, auf der nur zwei Namen standen - Martijn Faassen (der mit lxml angefangen hat) und ich. Werde ich also doch noch berühmt auf meine alten Tage ...

Einige Zeit habe ich mit Kay Hayen diskutiert, der an einem statischen Python-Compiler namens Nuitka schreibt. Wenig überraschend stand er dabei vor etlichen der Probleme, in die wir auch mit Cython gelaufen sind. Er hat Recht damit, dass ich nicht gänzlich froh darüber bin, dass er ein separates Projekt begonnen hat anstatt uns mit Cython zu helfen, aber so ist OpenSource nun einmal. Jede/r hat das Recht, so viele Räder zu erfinden wie es Spaß macht. Soweit ich es verstehe, strebt Kay mit Nuitka eine Untermenge von dem an was wir aus Cython machen, aber kommt dabei von der anderen Seite. Cython war früher nur eine Erweiterungssprache und entwickelt sich nun zusätzlich zu einem vollwertigen Python-Compiler, wohingegen Nuitka einzig und allein die Nische des Python-Compilers füllen soll. Aber bisher hat Kay schon dabei einiges an Chuzpe und Durchhaltevermögen bewiesen, da warten vielleicht noch ein paar Überraschungen...

Es war interessant, einige Vorträge zu Themen zu sehen, mit denen sich auch mein Arbeitgeber herumschlägt - nur eben mit Python statt mit Java. So arbeitet beispielsweise eine interne Abteilung bei SAP an einer Web-basierten Client-Infrastruktur für SAP-Systeme in Python, inklusive Objekt-nach-SAP Mapper (ähnlich einem ORM), Offline-Caching-Mechanismen und so weiter. Im Vortrag sah es ganz so aus als könnte sich das allgemein als interessant für SAP-Clients erweisen, auch ganz unabhängig von Web-Anwendungen. Und es könnte bald schon OpenSource sein...

Ein weiterer Vortrag, bei dem ich mich gleich heimisch gefühlt habe, handelte von PyTAF, einem grafischen Framework zur Anwendungsintegration. Es wird intern bei der LBBW in Stuttgart entwickelt und erreicht mehr oder weniger das, was wir in Java machen. Darüber hinaus hat es ein GUI mit dem Datenintegrationsprozesse grafisch zusammengesetzt werden können, und es ist in Python geschrieben, was bei dieser Art von Software ein ernst zu nehmender Vorteil ist. Übrigens verwendet es intern lxml.objectify zur Datenverarbeitung - eine sehr gute Wahl :)

Die PyCon-DE im nächsten Jahr könnte ruhig wieder am selben Ort stattfinden. So gut, wie die diesjährige funktioniert hat, gibt es eigentlich keinen Grund zu wechseln. Obwohl sowas wie Berlin natürlich auch immer eine Reise wert ist ...