XML parser performance in PyPy
I recently showed some benchmark results comparing the XML parser performance in CPython 3.3 to that in PyPy 1.7. Here’s an update for PyPy 1.9 that also includes the current state of the lxml port to that platform, parsing a 3.4MB document style XML file.
CPython 3.3pre:
Initial Memory usage: 11332 xml.etree.ElementTree.parse done in 0.041 seconds Memory usage: 21468 (+10136) xml.etree.cElementTree.parse done in 0.041 seconds Memory usage: 21464 (+10132) xml.etree.cElementTree.XMLParser.feed(): 25317 nodes read in 0.041 seconds Memory usage: 21736 (+10404) lxml.etree.parse done in 0.032 seconds Memory usage: 28324 (+16992) drop_whitespace.parse done in 0.030 seconds Memory usage: 25172 (+13840) lxml.etree.XMLParser.feed(): 25317 nodes read in 0.037 seconds Memory usage: 30608 (+19276) minidom tree read in 0.492 seconds Memory usage: 29852 (+18520)
PyPy without JIT warming:
Initial Memory usage: 42156 xml.etree.ElementTree.parse done in 0.452 seconds Memory usage: 44084 (+1928) xml.etree.cElementTree.parse done in 0.450 seconds Memory usage: 44080 (+1924) xml.etree.cElementTree.XMLParser.feed(): 25317 nodes read in 0.457 seconds Memory usage: 47920 (+5768) lxml.etree.parse done in 0.033 seconds Memory usage: 58688 (+16536) drop_whitespace.parse done in 0.033 seconds Memory usage: 55536 (+13384) lxml.etree.XMLParser.feed(): 25317 nodes read in 0.055 seconds Memory usage: 64724 (+22564) minidom tree read in 0.541 seconds Memory usage: 59456 (+17296)
PyPy with JIT warmup:
Initial Memory usage: 646824 xml.etree.ElementTree.parse done in 0.341 seconds xml.etree.cElementTree.parse done in 0.345 seconds xml.etree.cElementTree.XMLParser.feed(): 25317 nodes read in 0.342 seconds lxml.etree.parse done in 0.026 seconds drop_whitespace.parse done in 0.025 seconds lxml.etree.XMLParser.feed(): 25317 nodes read in 0.039 seconds minidom tree read in 0.383 seconds
What you can quickly see is that lxml performs equally well on both (actually slightly faster on PyPy) and beats the other libraries on PyPy by more than an order of magnitude. The absolute numbers are fairly low, though, way below a second for the 3.4MB file. It’ll be interesting to see some more complete benchmarks at some point that also take some realistic processing into account.
Remark: cElementTree is just an alias for the plain Python ElementTree on PyPy and ElementTree uses cElementTree in the background in CPython 3.3, which is why both show the same performance. The memory sizes were measured in forked processes, whereas the PyPy JIT numbers were measured in a repeatedly running process in order to take advantage of the JIT compiler. Note the substantially higher memory load of PyPy here.
Update: I originally reported the forked memory size with the non-forked performance for PyPy. The above now shows both separately. A more real-world comparison would likely yield an even higher memory usage on PyPy than the numbers above, which were mostly meant to give an idea of the memory usage of the in-memory tree (i.e. the data impact).
Juni 26th, 2012 at 05:31
Woo! I””m very happy pypy is doing well with xml, and that lxml is on pypy now! Rejoice.
Can you publish results without warming up the jit? This is a more reasonable case for some of my tools. cmd line tools that, Parse xml, do something, quit process. I think your tests would be a good to guide on performance for long running python web servers that parse xml though.
Juni 26th, 2012 at 10:33
I updated the numbers to more correctly reflect the PyPy behaviour with and without JIT warming. Note the huge memory usage of PyPy after JIT warming.
Juni 26th, 2012 at 22:36
Hi Stefan
I think the pypy memory usage is not really acceptable. I would be happy to look how to improve the situation (maybe you cannot, but who knows). It”’’s definitely not just “jitted code”.
Feel free to post a request to pypy-dev or even a bug (but no promises when I””ll have time to look at it)
Cheers,
fijal
Juni 28th, 2012 at 17:36
Hi Maciej, I agree, the memory usage might hint at a leak - couldn””t look into it yet. If you want to give it a try, I put the code and data files here:
http://lxml.de/etbench.tar.bz2
For lxml, you””ll need the latest github source versions from both lxml and Cython as well as the dev packages of libxml2 and libxslt.
Juli 17th, 2012 at 11:58
For lxml, you””ll need the latest github source versions from both lxml and Cython as well as the dev packages of libxml2 and libxslt.
0.16 (released 2012-04-21) will not work?
Juli 17th, 2012 at 12:04
Right, PyPy support is a feature that is still under development.