XML parser performance in Python 3.3
For a recent bug ticket about MiniDOM, I collected some performance numbers that compare it to ElementTree, cElementTree and lxml.etree under a recent CPython 3.3 developer build, all properly compiled and optimised for Linux 64bit, using os.fork() and the resource module to get a clean measure of the memory usage for the in-memory tree. Here are the numbers:
Parsing hamlet.xml in English, 274KB:
Memory usage: 7284
xml.etree.ElementTree.parse done in 0.104 seconds
Memory usage: 14240 (+6956)
xml.etree.cElementTree.parse done in 0.022 seconds
Memory usage: 9736 (+2452)
lxml.etree.parse done in 0.014 seconds
Memory usage: 11028 (+3744)
minidom tree read in 0.152 seconds
Memory usage: 30360 (+23076)
Parsing the old testament in English (ot.xml, 3.4MB) into memory:
Memory usage: 20444 xml.etree.ElementTree.parse done in 0.385 seconds Memory usage: 46088 (+25644) xml.etree.cElementTree.parse done in 0.056 seconds Memory usage: 32628 (+12184) lxml.etree.parse done in 0.041 seconds Memory usage: 37500 (+17056) minidom tree read in 0.672 seconds Memory usage: 110428 (+89984)
A 25MB XML file with Slavic Unicode text content:
Memory usage: 57368 xml.etree.ElementTree.parse done in 3.274 seconds Memory usage: 223720 (+166352) xml.etree.cElementTree.parse done in 0.459 seconds Memory usage: 154012 (+96644) lxml.etree.parse done in 0.454 seconds Memory usage: 135720 (+78352) minidom tree read in 6.193 seconds Memory usage: 604860 (+547492)
And a contrived 4.5MB XML file with a lot more structure than data and no whitespace at all:
Memory usage: 11600 xml.etree.ElementTree.parse done in 3.374 seconds Memory usage: 203420 (+191820) xml.etree.cElementTree.parse done in 0.192 seconds Memory usage: 36444 (+24844) lxml.etree.parse done in 0.131 seconds Memory usage: 62648 (+51048) minidom tree read in 5.935 seconds Memory usage: 527684 (+516084)
I also took the last file and pretty printed it, thus adding lots of indentation whitespace that increased the file size to 6.2MB. Here are the numbers for that:
Memory usage: 13308 xml.etree.ElementTree.parse done in 4.178 seconds Memory usage: 222088 (+208780) xml.etree.cElementTree.parse done in 0.478 seconds Memory usage: 103056 (+89748) lxml.etree.parse done in 0.199 seconds Memory usage: 101860 (+88552) minidom tree read in 8.705 seconds Memory usage: 810964 (+797656)
Yes, 2MB of whitespace account for almost 300MB more memory in MiniDOM.
Here are the graphs:
I think it is pretty clear that minidom has basically left the scale, whereas cElementTree and lxml.etree are pretty close to each other. lxml tends to be a tad faster, and cElementTree tends to use a little less memory.