Archive for the ‘Software’ Category

A faster Python difflib

Montag, März 21st, 2011

As part of a push for making it easier to develop faster and more readable core/stdlib code for the CPython runtime, I have written a short patch against the difflib module in the Python standard library to make it a) compile with Cython and b) run faster as a compiled binary module. The net result is that it runs more than 50% faster with only the minor code modifications provided in the patch, and still about as fast in the normal CPython interpreter.

no mod_deflate? use mod_rewrite!

Dienstag, Februar 22nd, 2011

Sadly, the shared hosting service for my web site does not support mod_deflate in its Apache installation. There are various resources on the web that deal with this in one way or another, but they all talk about the drawbacks of shutting out certain clients or presenting errror pages to some of them. Well, if you cannot get on-the-fly compression, there is at least the drawback of having to compress all your pages statically before hand, thus providing duplicate files for each. But that is actually a good thing - it serves even faster that way because it does not have to compress anything while it is serving the content. So you put in a little more work on site updates and trade space for speed where it matters.

Here is a simple way to configure mod_rewrite to serve the compressed pages even if you do not have mod_deflate available.

First, compress all your HTML pages, e.g. using

find -name "*.html" | while read file; do gzip -9c < $file > $file.gz; done

Since I am using make to handle my web site, here is an extract that helps me keeping all compressed files updated when I upload the pages (also in parallel, when I pass e.g. “-j 5″):

TEXT_FILES=$(shell find -name "*.html" -o -name "*.css" -o -name "*.js")

.PHONY: copy

copy: $(addsuffix .gz, $(TEXT_FILES))
	copy_website.sh

%.gz: %
	gzip -9c $< > $@

Now we can configure mod_rewrite to prefer these compressed files for clients that support it. To do this, I put the following into my .htaccess file:

RewriteCond %{REQUEST_FILENAME}.gz -f
RewriteCond %{HTTP:Accept-Encoding}   .*gzip.*
RewriteRule ^(.*[.])(html|js|css)$        $1$2.gz      [L]

To spell this out:

  1. check if there really is a compressed version of the requested file available (”-f” tests for the path being a file).
  2. check if the client tells us that it accepts “gzip” compressed content
  3. if both conditions hold, redirect the client to the compressed file version.

Given that the compressed files are often 5x smaller than the plain HTML version, this saves lots of bandwidth from my web site with really little effort.

exporting mbox archives from pipermail

Dienstag, Februar 15th, 2011

I just stumbled over this, and I find it totally worth writing up. You can directly export a pipermail archive in mbox format. This means that it’’s no longer a major problem to change a mailing list hoster, you can just grab the archives and have the new one import it, so that the complete history remains in one place (ok, there often are public archives as well, but they are not yours - the pipermail one usually is, and it should be!).

The magic URL is:

http://www.example.com/mailman/private/[listname].mbox/[listname].mbox

You go to that URL as adapted for your mailing list, log in, then go back to that URL (it usually redirects you somewhere else) and there you go, the download starts. I”m totally happy about this feature.

Spaß mit Xalan 2.7.1

Mittwoch, November 10th, 2010

Wenn ich ein XSL Stylesheet per <xsl:message> abbrechen lasse (was auch immer das mit einer Nachricht zu tun hat, aber gut…), dann schmeißt Xalan die Exception hier:

javax.xml.transform.TransformerException: Formatvorlage hat die Beendigung übertragen.

Da hat mal wieder jemand echt nachgedacht beim Übersetzen.

Seriously, there is a function for that

Freitag, Oktober 29th, 2010

I keep running into code like this:

tree = lxml.etree.parse( StringIO(bytes_data) )

The docs are actually very clear about this. There is a function called etree.fromstring(data) that is meant to parse from a string. It is the same as in ElementTree. Obviously, no-one reads documentation. But it’s there, really.

BeautifulSoup vs. lxml.html parser performance

Freitag, Oktober 29th, 2010

Here is yet another little performance comparison between BeautifulSoup and lxml.html. Especially the comparison graph is really fun to see.

Dive into Python 3 presents ElementTree and lxml.etree

Freitag, Oktober 29th, 2010

It’’s worth mentioning that the Python 3 edition of “Dive into Python” has a lot of rewritten and updated content. The thing that I like best about it is that it finally has an up-to-date chapter on XML that is entirely based on ElementTree and lxml.etree, the major XML libraries for Python. So, even for those who want to continue using Python 2 for a while, it’’s worth reading the new edition instead of the outdated Python 2 edition.

Locally loading DTDs from XML catalogs with lxml

Donnerstag, Oktober 28th, 2010

It seems that it is not obvious to all lxml users how DTDs and external entities are loaded by an XML processor. Specifically, if the system is misconfigured, it can happen that lxml fails to parse a document that needs a DTD or that it tries to load the DTD from the network repeatedly, when the no_network parser option is set to False (obviously, network access is blocked by default).

I commented on this on the lxml mailing list in 2008 when there was a discussion about high web traffic at the W3C due to excessive DTD loading, which was also attributed to parts of the Python standard library.

The right way to handle this (in general, but especially for lxml) is to configure the XML catalogs on the local system. The libxml2 site has some documentation on how to do this. The advantage of using catalogs is that most XML tools will use them when available, so it is a system wide fix for the problem. Most Linux installations come with readily configured XML catalogs, but other systems may have to get fixed up.