XML parser performance in PyPy

I recently showed some benchmark results comparing the XML parser performance in CPython 3.3 to that in PyPy 1.7. Here’s an update for PyPy 1.9 that also includes the current state of the lxml port to that platform, parsing a 3.4MB document style XML file.

CPython 3.3pre:

Initial Memory usage: 11332
xml.etree.ElementTree.parse done in 0.041 seconds
Memory usage: 21468 (+10136)
xml.etree.cElementTree.parse done in 0.041 seconds
Memory usage: 21464 (+10132)
xml.etree.cElementTree.XMLParser.feed(): 25317 nodes read in 0.041 seconds
Memory usage: 21736 (+10404)
lxml.etree.parse done in 0.032 seconds
Memory usage: 28324 (+16992)
drop_whitespace.parse done in 0.030 seconds
Memory usage: 25172 (+13840)
lxml.etree.XMLParser.feed(): 25317 nodes read in 0.037 seconds
Memory usage: 30608 (+19276)
minidom tree read in 0.492 seconds
Memory usage: 29852 (+18520)

PyPy without JIT warming:

Initial Memory usage: 42156
xml.etree.ElementTree.parse done in 0.452 seconds
Memory usage: 44084 (+1928)
xml.etree.cElementTree.parse done in 0.450 seconds
Memory usage: 44080 (+1924)
xml.etree.cElementTree.XMLParser.feed(): 25317 nodes read in 0.457 seconds
Memory usage: 47920 (+5768)
lxml.etree.parse done in 0.033 seconds
Memory usage: 58688 (+16536)
drop_whitespace.parse done in 0.033 seconds
Memory usage: 55536 (+13384)
lxml.etree.XMLParser.feed(): 25317 nodes read in 0.055 seconds
Memory usage: 64724 (+22564)
minidom tree read in 0.541 seconds
Memory usage: 59456 (+17296)

PyPy with JIT warmup:

Initial Memory usage: 646824
xml.etree.ElementTree.parse done in 0.341 seconds
xml.etree.cElementTree.parse done in 0.345 seconds
xml.etree.cElementTree.XMLParser.feed(): 25317 nodes read in 0.342 seconds
lxml.etree.parse done in 0.026 seconds
drop_whitespace.parse done in 0.025 seconds
lxml.etree.XMLParser.feed(): 25317 nodes read in 0.039 seconds
minidom tree read in 0.383 seconds

What you can quickly see is that lxml performs equally well on both (actually slightly faster on PyPy) and beats the other libraries on PyPy by more than an order of magnitude. The absolute numbers are fairly low, though, way below a second for the 3.4MB file. It’ll be interesting to see some more complete benchmarks at some point that also take some realistic processing into account.

Remark: cElementTree is just an alias for the plain Python ElementTree on PyPy and ElementTree uses cElementTree in the background in CPython 3.3, which is why both show the same performance. The memory sizes were measured in forked processes, whereas the PyPy JIT numbers were measured in a repeatedly running process in order to take advantage of the JIT compiler. Note the substantially higher memory load of PyPy here.

Update: I originally reported the forked memory size with the non-forked performance for PyPy. The above now shows both separately. A more real-world comparison would likely yield an even higher memory usage on PyPy than the numbers above, which were mostly meant to give an idea of the memory usage of the in-memory tree (i.e. the data impact).

6 Responses to “XML parser performance in PyPy”

  1. Rene Dudfield Says:

    Woo! I””m very happy pypy is doing well with xml, and that lxml is on pypy now! Rejoice.

    Can you publish results without warming up the jit? This is a more reasonable case for some of my tools. cmd line tools that, Parse xml, do something, quit process. I think your tests would be a good to guide on performance for long running python web servers that parse xml though.

  2. Stefan Behnel Says:

    I updated the numbers to more correctly reflect the PyPy behaviour with and without JIT warming. Note the huge memory usage of PyPy after JIT warming.

  3. fijal Says:

    Hi Stefan

    I think the pypy memory usage is not really acceptable. I would be happy to look how to improve the situation (maybe you cannot, but who knows). It”’’s definitely not just “jitted code”.

    Feel free to post a request to pypy-dev or even a bug (but no promises when I””ll have time to look at it)

    Cheers,
    fijal

  4. Stefan Behnel Says:

    Hi Maciej, I agree, the memory usage might hint at a leak - couldn””t look into it yet. If you want to give it a try, I put the code and data files here:

    http://lxml.de/etbench.tar.bz2

    For lxml, you””ll need the latest github source versions from both lxml and Cython as well as the dev packages of libxml2 and libxslt.

  5. Valery Says:

    For lxml, you””ll need the latest github source versions from both lxml and Cython as well as the dev packages of libxml2 and libxslt.
    0.16 (released 2012-04-21) will not work?

  6. Stefan Behnel Says:

    Right, PyPy support is a feature that is still under development.

Leave a Reply