I removed lxml from Flattr. The outcome is just way below ground. The Flattr revenue for the project over the last years was a pretty constant 6 EUR per month. Minus taxes. And Flattr already kept their 10% share, which is what bugs me most, I guess. That leaves Paypal as the current (and certainly better) alternative if you want to support the development behind lxml.
Archive for the ‘lxml’ Category
I recently showed some benchmark results comparing the XML parser performance in CPython 3.3 to that in PyPy 1.7. Here’s an update for PyPy 1.9 that also includes the current state of the lxml port to that platform, parsing a 3.4MB document style XML file.
Initial Memory usage: 11332 xml.etree.ElementTree.parse done in 0.041 seconds Memory usage: 21468 (+10136) xml.etree.cElementTree.parse done in 0.041 seconds Memory usage: 21464 (+10132) xml.etree.cElementTree.XMLParser.feed(): 25317 nodes read in 0.041 seconds Memory usage: 21736 (+10404) lxml.etree.parse done in 0.032 seconds Memory usage: 28324 (+16992) drop_whitespace.parse done in 0.030 seconds Memory usage: 25172 (+13840) lxml.etree.XMLParser.feed(): 25317 nodes read in 0.037 seconds Memory usage: 30608 (+19276) minidom tree read in 0.492 seconds Memory usage: 29852 (+18520)
PyPy without JIT warming:
Initial Memory usage: 42156 xml.etree.ElementTree.parse done in 0.452 seconds Memory usage: 44084 (+1928) xml.etree.cElementTree.parse done in 0.450 seconds Memory usage: 44080 (+1924) xml.etree.cElementTree.XMLParser.feed(): 25317 nodes read in 0.457 seconds Memory usage: 47920 (+5768) lxml.etree.parse done in 0.033 seconds Memory usage: 58688 (+16536) drop_whitespace.parse done in 0.033 seconds Memory usage: 55536 (+13384) lxml.etree.XMLParser.feed(): 25317 nodes read in 0.055 seconds Memory usage: 64724 (+22564) minidom tree read in 0.541 seconds Memory usage: 59456 (+17296)
PyPy with JIT warmup:
Initial Memory usage: 646824 xml.etree.ElementTree.parse done in 0.341 seconds xml.etree.cElementTree.parse done in 0.345 seconds xml.etree.cElementTree.XMLParser.feed(): 25317 nodes read in 0.342 seconds lxml.etree.parse done in 0.026 seconds drop_whitespace.parse done in 0.025 seconds lxml.etree.XMLParser.feed(): 25317 nodes read in 0.039 seconds minidom tree read in 0.383 seconds
What you can quickly see is that lxml performs equally well on both (actually slightly faster on PyPy) and beats the other libraries on PyPy by more than an order of magnitude. The absolute numbers are fairly low, though, way below a second for the 3.4MB file. It’ll be interesting to see some more complete benchmarks at some point that also take some realistic processing into account.
Remark: cElementTree is just an alias for the plain Python ElementTree on PyPy and ElementTree uses cElementTree in the background in CPython 3.3, which is why both show the same performance. The memory sizes were measured in forked processes, whereas the PyPy JIT numbers were measured in a repeatedly running process in order to take advantage of the JIT compiler. Note the substantially higher memory load of PyPy here.
Update: I originally reported the forked memory size with the non-forked performance for PyPy. The above now shows both separately. A more real-world comparison would likely yield an even higher memory usage on PyPy than the numbers above, which were mostly meant to give an idea of the memory usage of the in-memory tree (i.e. the data impact).
In a recent article, I compared the performance of MiniDOM and the three ElementTree implementations ElementTree, cElementTree and lxml.etree for parsing XML in CPython 3.3. Given the utterly poor performance of the pure Python library MiniDOM in this competition, I decided to give it another chance and tried the same in PyPy 1.7. Because lxml.etree and cElementTree are not available on this platform, I only ran the tests with plain ElementTree and MiniDOM. I also report the original benchmark results for CPython below for comparison.
While I also provide numbers regarding the memory usage of each library in this comparison, they are not directly comparable between PyPy and CPython because of the different memory management of both platforms and because the overall memory that PyPy uses right from the start is much larger than for CPython. So the relative increase in memory may or may not be an accurate way to tell what each runtime does with the memory. However, it appears that PyPy manages to kill at least the severe memory problems of MiniDOM, as the total amount of memory used for the larger files is several times smaller than that used by CPython.
So, what do I take from this benchmark? If you have legacy MiniDOM code lying around, you want PyPy to run it. It exhibits several times better performance in terms of memory and runtime. It also performs substantially better for ElementTree than the plain Python ElementTree in CPython.
However, for fast XML processing in general, the better performance of PyPy even for plain Python ElementTree is not really all that interesting, because it is still several times slower than cElementTree or lxml.etree in CPython. That means that you will often be able to process multiple files in CPython in the time that you need for just one in PyPy, even if your actual application code that does the processing manages to get a substantial JIT speed-up in PyPy. Even worse, the GIL in PyPy will keep your code from getting a parallel speedup that you usually get with multi-threaded processing in lxml and CPython, e.g. in a web server setting.
So, as always, the decision depends on what your actual application does and which library it uses. Do your own benchmarks.
For a recent bug ticket about MiniDOM, I collected some performance numbers that compare it to ElementTree, cElementTree and lxml.etree under a recent CPython 3.3 developer build, all properly compiled and optimised for Linux 64bit, using os.fork() and the resource module to get a clean measure of the memory usage for the in-memory tree. Here are the numbers:
Parsing hamlet.xml in English, 274KB:
Memory usage: 7284 xml.etree.ElementTree.parse done in 0.104 seconds Memory usage: 14240 (+6956) xml.etree.cElementTree.parse done in 0.022 seconds Memory usage: 9736 (+2452) lxml.etree.parse done in 0.014 seconds Memory usage: 11028 (+3744) minidom tree read in 0.152 seconds Memory usage: 30360 (+23076)
Parsing the old testament in English (ot.xml, 3.4MB) into memory:
Memory usage: 20444 xml.etree.ElementTree.parse done in 0.385 seconds Memory usage: 46088 (+25644) xml.etree.cElementTree.parse done in 0.056 seconds Memory usage: 32628 (+12184) lxml.etree.parse done in 0.041 seconds Memory usage: 37500 (+17056) minidom tree read in 0.672 seconds Memory usage: 110428 (+89984)
A 25MB XML file with Slavic Unicode text content:
Memory usage: 57368 xml.etree.ElementTree.parse done in 3.274 seconds Memory usage: 223720 (+166352) xml.etree.cElementTree.parse done in 0.459 seconds Memory usage: 154012 (+96644) lxml.etree.parse done in 0.454 seconds Memory usage: 135720 (+78352) minidom tree read in 6.193 seconds Memory usage: 604860 (+547492)
And a contrived 4.5MB XML file with a lot more structure than data and no whitespace at all:
Memory usage: 11600 xml.etree.ElementTree.parse done in 3.374 seconds Memory usage: 203420 (+191820) xml.etree.cElementTree.parse done in 0.192 seconds Memory usage: 36444 (+24844) lxml.etree.parse done in 0.131 seconds Memory usage: 62648 (+51048) minidom tree read in 5.935 seconds Memory usage: 527684 (+516084)
I also took the last file and pretty printed it, thus adding lots of indentation whitespace that increased the file size to 6.2MB. Here are the numbers for that:
Memory usage: 13308 xml.etree.ElementTree.parse done in 4.178 seconds Memory usage: 222088 (+208780) xml.etree.cElementTree.parse done in 0.478 seconds Memory usage: 103056 (+89748) lxml.etree.parse done in 0.199 seconds Memory usage: 101860 (+88552) minidom tree read in 8.705 seconds Memory usage: 810964 (+797656)
Yes, 2MB of whitespace account for almost 300MB more memory in MiniDOM.
Here are the graphs:
I think it is pretty clear that minidom has basically left the scale, whereas cElementTree and lxml.etree are pretty close to each other. lxml tends to be a tad faster, and cElementTree tends to use a little less memory.
I keep running into code like this:
tree = lxml.etree.parse( StringIO(bytes_data) )
The docs are actually very clear about this. There is a function called
etree.fromstring(data) that is meant to parse from a string. It is the same as in ElementTree. Obviously, no-one reads documentation. But it’s there, really.
It’’s worth mentioning that the Python 3 edition of “Dive into Python” has a lot of rewritten and updated content. The thing that I like best about it is that it finally has an up-to-date chapter on XML that is entirely based on ElementTree and lxml.etree, the major XML libraries for Python. So, even for those who want to continue using Python 2 for a while, it’’s worth reading the new edition instead of the outdated Python 2 edition.