XML parser performance in Python 3.3

For a recent bug ticket about MiniDOM, I collected some performance numbers that compare it to ElementTree, cElementTree and lxml.etree under a recent CPython 3.3 developer build, all properly compiled and optimised for Linux 64bit, using os.fork() and the resource module to get a clean measure of the memory usage for the in-memory tree. Here are the numbers:

Parsing hamlet.xml in English, 274KB:

Memory usage: 7284

xml.etree.ElementTree.parse done in 0.104 seconds

Memory usage: 14240 (+6956)

xml.etree.cElementTree.parse done in 0.022 seconds

Memory usage: 9736 (+2452)

lxml.etree.parse done in 0.014 seconds

Memory usage: 11028 (+3744)

minidom tree read in 0.152 seconds

Memory usage: 30360 (+23076)

Parsing the old testament in English (ot.xml, 3.4MB) into memory:

Memory usage: 20444

xml.etree.ElementTree.parse done in 0.385 seconds

Memory usage: 46088 (+25644)

xml.etree.cElementTree.parse done in 0.056 seconds

Memory usage: 32628 (+12184)

lxml.etree.parse done in 0.041 seconds

Memory usage: 37500 (+17056)

minidom tree read in 0.672 seconds

Memory usage: 110428 (+89984)

A 25MB XML file with Slavic Unicode text content:

Memory usage: 57368

xml.etree.ElementTree.parse done in 3.274 seconds

Memory usage: 223720 (+166352)

xml.etree.cElementTree.parse done in 0.459 seconds

Memory usage: 154012 (+96644)

lxml.etree.parse done in 0.454 seconds

Memory usage: 135720 (+78352)

minidom tree read in 6.193 seconds

Memory usage: 604860 (+547492)

And a contrived 4.5MB XML file with a lot more structure than data and no whitespace at all:

Memory usage: 11600

xml.etree.ElementTree.parse done in 3.374 seconds

Memory usage: 203420 (+191820)

xml.etree.cElementTree.parse done in 0.192 seconds

Memory usage: 36444 (+24844)

lxml.etree.parse done in 0.131 seconds

Memory usage: 62648 (+51048)

minidom tree read in 5.935 seconds

Memory usage: 527684 (+516084)

I also took the last file and pretty printed it, thus adding lots of indentation whitespace that increased the file size to 6.2MB. Here are the numbers for that:

Memory usage: 13308

xml.etree.ElementTree.parse done in 4.178 seconds

Memory usage: 222088 (+208780)

xml.etree.cElementTree.parse done in 0.478 seconds

Memory usage: 103056 (+89748)

lxml.etree.parse done in 0.199 seconds

Memory usage: 101860 (+88552)

minidom tree read in 8.705 seconds

Memory usage: 810964 (+797656)

Yes, 2MB of whitespace account for almost 300MB more memory in MiniDOM.

Here are the graphs:

XML tree memory usage in Python 3.3 for lxml, ElementTree, cElementTree and MiniDOM

XML perser performance in Python 3.3 for lxml, ElementTree, cElementTree and MiniDOM

I think it is pretty clear that minidom has basically left the scale, whereas cElementTree and lxml.etree are pretty close to each other. lxml tends to be a tad faster, and cElementTree tends to use a little less memory.