PyCon-DE 2011 (en)

The first PyCon-DE ever is over. It was a huge success, both from my own POV and from what I heard from others. Quite a number of interesting talks from a very broad spectrum, loads of people that I either knew already, always wanted to meet, or had never heard of but found interesting to talk to. The organisation worked out impressively well, even the food was as good as it was diverse.

One of the major outcomes was the formation of the "Python Software Verband e.V." as a successor to the previous Zope centered "DZUG e.V.". The new direction will make it much easier to gather the German speaking Python community under a common umbrella, and to enforce the Python lobbying in Germany, Austria and Switzerland.

I gave two talks on Cython and lxml, as well as a tutorial on Cython. All of them were well received (although I''m still waiting for the final feedback on the tutorial) and gave the chance for interesting discussions. Both Cython and lxml continue to be best of breed tools and hot topics in the community, and I received a lot of backslapping for making lxml the one great XML tool for Python over the last few years. One of the keynote speakers, Paul Everitt, whom I wanted to meet for a while until I finally got the chance now, even put up a huge slide right in his talk with only two names on it, that of Martijn Faassen (the original author of lxml) and mine. I''m finally getting famous. ;)

I spent some time talking to Kay Hayen, who has written a static Python compiler called Nuitka. It was not surprising that he bumped into a lot of the problems that we met with Cython as well. He's right in that I''m not entirely happy about the fact that he started a completely separate project instead of helping with Cython, but that's OpenSource. People are free to reinvent as many wheels as they like. From what I understand, Nuitka aims to become a subset of what Cython heads for, just coming from a different side. Cython has originally been an extension language and is now additionally evolving into a Python compiler, whereas Nuitka is plainly targeted at being a Python compiler. But I wouldn't mind getting surprised at some point. So far, Kay has certainly shown a remarkable investment and was pretty successful.

It was nice to see in a couple of presentations that the kind of things that the company I currently work for is doing in Java is done in Python in other places. For example, an internal department at SAP is developping a Web based client infrastructure for SAP systems in Python, including a transparent object-to-SAP mapper (similar to ORMs), offline caching mechanisms, etc. From the presentation, it sounded very much like this could be useful for talking to SAP in general, not only for web clients. And it may become open source at some point.

Another feel-alike talk was about PyTAF, a graphical application integration framework for financial applications that is being developped in-house at LBBW in Stuttgart. It aims to do more or less the same as the code we write in Java, but has a graphical frontend for putting together integration flows. And, it's Python, which is a serious advantage for this kind of software. It even uses lxml.objectify internally for data processing - best choice ever! :)

It may well be that next year's PyCon-DE will take place at the same location. It worked so well that there's no reason for a change. Although Berlin would also be a great location...

Fix URL display in Firefox 7

Firefox 7 comes with a very annoying "feature" that breaks copying from the URL bar by stripping away the protocol prefix from the URL. Here is how to fix it. The magic option in "about:config" is called "browser.urlbar.trimURLs". Switch it off and Firefox starts working again.

Cython close to hand-crafted C code for generators

I did a couple of experiments compiling itertools with the new generator support in Cython. In CPython, the itertools module is actually written in hand tuned C and does very little computation in its generators, so I knew it would be hard to reach with generated code. But Cython does a pretty good job.

Something as trivial as chain() is exactly as fast as in the C implementation, but compared to the more than 60 lines of C code, it is certainly a lot more readable in Cython:

def chain(*iterables): """Make an iterator that returns elements from the first iterable until it is exhausted, then proceeds to the next iterable, until all of the iterables are exhausted. Used for treating consecutive sequences as a single sequence. """ for it in iterables: for element in it: yield element

Other functions, like islice(), are faster in C, partly because CPython actually takes a couple of shortcuts, e.g. by only looking up the iterator slot method once. You cannot do that in Python code, and I wanted to keep the implementation compatible with regular Python. Specifically, the C speed advantage for islice() is currently about 30-50% in general, although the Cython implementation can also be up to 10% faster for some cases, e.g. when extracting only a couple of items from the middle of a longer sequence. The C implementation is about 90 lines, here is the Cython implementation:

import sys

import cython

# Python 2/3 compatibility

_max_size = cython.declare(cython.Py_ssize_t,
        getattr(sys, "maxsize", getattr(sys, "maxint", None)))

@cython.locals(i=cython.Py_ssize_t, nexti=cython.Py_ssize_t,
               start=cython.Py_ssize_t, stop=cython.Py_ssize_t, step=cython.Py_ssize_t)
def islice(iterable, *args):
    """Make an iterator that returns selected elements from the
    iterable. If start is non-zero, then elements from the iterable
    are skipped until start is reached. Afterward, elements are
    returned consecutively unless step is set higher than one which
    results in items being skipped. If stop is None, then iteration
    continues until the iterator is exhausted, if at all; otherwise,
    it stops at the specified position. Unlike regular slicing,
    islice() does not support negative values for start, stop, or
    step. Can be used to extract related fields from data where the
    internal structure has been flattened (for example, a multi-line
    report may list a name field on every third line).
    s = slice(*args)
    start = s.start or 0
    stop = s.stop or _max_size
    step = s.step or 1
    if start = stop:
    nexti = start
    for i, element in enumerate(iterable):
        if i == nexti:
            yield element
            nexti += step
            if nexti >= stop or nexti 

Here is one that is conceptually quite simple: count(). I had to optimise it quite a bit, because the iteration code in the C code is extremely tight. Even the tuned version below runs about 10% slower than the hand tuned C version, which is about 230 lines long.


def count(n=0):
    """Make an iterator that returns consecutive integers starting
    with n. If not specified n defaults to zero. Often used as an argument to imap()
    to generate consecutive data points. Also, used with zip() to add
    sequence numbers.
        i = n
    except OverflowError:
        i = _max_size # skip i-loop
        n = _max_size # first value after i-loop
    while i 

Note that all of the above generators execute in the order of microseconds, so even a slow-down of 50% will likely not be measurable in real world code.

So far, I did not try any of the more fancy functions in itertools (those that actually do something). The Cython project has announced a Google Summer of Code project with exactly the intent to rewrite some of the C stdlib modules of CPython in pure Python code with Cython compiler hints. So I leave this exercise to interested readers for now.

A faster Python difflib

As part of a push for making it easier to develop faster and more readable core/stdlib code for the CPython runtime, I have written a short patch against the difflib module in the Python standard library to make it a) compile with Cython and b) run faster as a compiled binary module. The net result is that it runs more than 50% faster with only the minor code modifications provided in the patch, and still about as fast in the normal CPython interpreter.

no mod_deflate? use mod_rewrite!

Sadly, the shared hosting service for my web site does not support mod_deflate in its Apache installation. There are various resources on the web that deal with this in one way or another, but they all talk about the drawbacks of shutting out certain clients or presenting errror pages to some of them. Well, if you cannot get on-the-fly compression, there is at least the drawback of having to compress all your pages statically before hand, thus providing duplicate files for each. But that is actually a good thing - it serves even faster that way because it does not have to compress anything while it is serving the content. So you put in a little more work on site updates and trade space for speed where it matters.

Here is a simple way to configure mod_rewrite to serve the compressed pages even if you do not have mod_deflate available.

First, compress all your HTML pages, e.g. using

find -name "*.html" | while read file; do gzip -9c < $file > $file.gz; done

Since I am using make to handle my web site, here is an extract that helps me keeping all compressed files updated when I upload the pages (also in parallel, when I pass e.g. "-j 5"):

TEXT_FILES=$(shell find -name "*.html" -o -name "*.css" -o -name "*.js")

.PHONY: copy

copy: $(addsuffix .gz, $(TEXT_FILES))

%.gz: %
    gzip -9c $< > $@

Now we can configure mod_rewrite to prefer these compressed files for clients that support it. To do this, I put the following into my .htaccess file:

RewriteCond %{REQUEST_FILENAME}.gz -f

RewriteCond %{HTTP:Accept-Encoding}   .*gzip.*

RewriteRule ^(.*[.])(html|js|css)$        $1$2.gz      [L]

To spell this out:

  1. check if there really is a compressed version of the requested file available ("-f" tests for the path being a file).
  2. check if the client tells us that it accepts "gzip" compressed content
  3. if both conditions hold, redirect the client to the compressed file version.

Given that the compressed files are often 5x smaller than the plain HTML version, this saves lots of bandwidth from my web site with really little effort.

exporting mbox archives from pipermail

I just stumbled over this, and I find it totally worth writing up. You can directly export a pipermail archive in mbox format. This means that it's no longer a major problem to change a mailing list hoster, you can just grab the archives and have the new one import it, so that the complete history remains in one place (ok, there often are public archives as well, but they are not yours - the pipermail one usually is, and it should be!).

The magic URL is:[listname].mbox/[listname].mbox

You go to that URL as adapted for your mailing list, log in, then go back to that URL (it usually redirects you somewhere else) and there you go, the download starts. I''m totally happy about this feature.

Spaß mit Xalan 2.7.1

Wenn ich ein XSL Stylesheet per <xsl:message> abbrechen lasse (was auch immer das mit einer Nachricht zu tun hat, aber gut...), dann schmeißt Xalan die Exception hier:

javax.xml.transform.TransformerException: Formatvorlage hat die Beendigung übertragen.

Da hat mal wieder jemand echt nachgedacht beim Übersetzen.