Finally some I/O benchmarks on lxml

Stefan Behnel

2006-05-08 16:53

In case you don't know, lxml is one of the most feature-rich XML libraries for Python. It closely follows the ElementTree API and extends it with support for XSLT, XPath, RelaxNG and loads of other XML candies, all driven by the marvelous libxml2/libxslt libraries. I'm one of the authors of lxml, so be careful, I may be biased.

The benchmarks I ran during the development of version 0.9 were mainly geared towards comparisons of the API. Some of them were even chosen explicitly to show where lxml's performance is low enough to merit some work. Now, I finally came up with some simple benchmarks on the I/O part. And those put lxml into a completely different light.

Imagine you wanted to serialize a large XML tree to UTF-8 (which is the internal encoding used in lxml and arguably the most common serialization for XML):


lxe: tostring_utf8             (U- T1     )   21.6062 msec/pass

ET : tostring_utf8             (U- T1     )  658.4980 msec/pass

cET: tostring_utf8             (U- T1     )  618.3270 msec/pass

or to UTF-16 (which is not the internal encoding):


lxe: tostring_utf16            (S- T1     )   24.6755 msec/pass

ET : tostring_utf16            (S- T1     )  668.2270 msec/pass

cET: tostring_utf16            (S- T1     )  629.3236 msec/pass

And how about serializing a tree to UTF-8, write it into a StringIO object and then parsing it back into an element tree:


lxe: write_utf8_parse_stringIO (S- T1     )  188.3298 msec/pass

ET : write_utf8_parse_stringIO (S- T1     ) 1143.8117 msec/pass

cET: write_utf8_parse_stringIO (S- T1     )  810.7611 msec/pass

or with an intermediate step of unicode() conversion after serializing it to a UTF-8 string, and before we pass it back into XML():


lxe: tostring_utf8_unicode_XML (U- T2     )  209.5674 msec/pass

ET : tostring_utf8_unicode_XML (U- T2     ) 1022.0318 msec/pass

cET: tostring_utf8_unicode_XML (U- T2     )  678.8596 msec/pass

Interesting, isn't it? The reason for this is that lxml's parser (libxml2) runs completely in C and does not instantiate any Python representations of elements. What this tells us is that lxml is extremely fast on I/O, much faster than (c)ElementTree, as long as you only touch a few elements in the tree. But that's the common case, right? You parse an XML file, touch a few nodes in it, maybe run some XPaths or XSLTs on them, change the ordering here, remove some nodes there and then serialize it. Well, that's what lxml excels in!

I should mention that I also wrote the benchmark script, but I did not write the numbers above myself! :-) The complete benchmark results are also available.

Any new users attracted?