Faster XML stream processing in Python

Stefan Behnel

2019-07-26 17:36

It's been a while since I last wrote something about processing XML, specifically about finding something in XML. Recently, I read a blog post by Eli Bendersky about faster XML processing in Go, and he was comparing it to iterparse() in Python's ElementTree and lxml. Basically, all he said about lxml is that it performs more or less like ElementTree, so he concentrated on the latter (and on C and Go). That's not wrong to say, but it also doesn't help much. lxml has much more fine-grained tools for processing XML, so here's a reply.

I didn't have the exact same XML input file that Eli used, but I used the same (deterministic, IIUC) tool for generating one, running xmlgen -f2 -o bench.xml. That resulted in a 223MiB XML file of the same structure that Eli used, thus probably almost the same as his.

Let's start with the original implementation:

import sys
import xml.etree.ElementTree as ET

count = 0
for event, elem in ET.iterparse(sys.argv[1], events=("end",)):
    if event == "end":
        if elem.tag == 'location' and elem.text and 'Africa' in elem.text:
            count += 1
        elem.clear()

print('count =', count)

The code parses the XML file, searches for location tags, and counts those that contain the word Africa.

Running this under time with ElementTree in CPython 3.6.8 (Ubuntu 18.04) shows:

count = 92
4.79user 0.08system 0:04.88elapsed 99%CPU (0avgtext+0avgdata 14828maxresident)k

We can switch to lxml (4.3.4) by changing the import to import lxml.etree as ET:

count = 92
4.58user 0.08system 0:04.67elapsed 99%CPU (0avgtext+0avgdata 23060maxresident)k

You can see that it uses somewhat more memory overall (~23MiB), but runs just a little faster, not even 5%. Both are roughly comparable.

For comparison, the base line memory usage of doing nothing but importing ElementTree versus lxml is:

$ time python3.6 -c 'import xml.etree.ElementTree'
0.08user 0.01system 0:00.09elapsed 96%CPU (0avgtext+0avgdata 9892maxresident)k
0inputs+0outputs (0major+1202minor)pagefaults 0swaps

$ time python3.6 -c 'import lxml.etree'
0.07user 0.01system 0:00.09elapsed 96%CPU (0avgtext+0avgdata 15264maxresident)k
0inputs+0outputs (0major+1742minor)pagefaults 0swaps

Back to our task at hand. As you may know, global variables in Python are more costly than local variables, and as you certainly know, global module code is badly testable. So, let's start with something obvious that we would always do in Python: write a function.

import sys
import lxml.etree as ET

def count_locations(file_path, match):
    count = 0
    for event, elem in ET.iterparse(file_path, events=("end",)):
        if event == "end":
            if elem.tag == 'location' and elem.text and 'Africa' in elem.text:
                count += 1
            elem.clear()

count = count_locations(sys.args[1], 'Africa')
print('count =', count)

count = 92
4.39user 0.06system 0:04.46elapsed 99%CPU (0avgtext+0avgdata 23264maxresident)k

Another thing we can see is that we're explicitly asking for only end events, and then check if the event we got is an end event. That's redundant. Removing this line yields:

count = 92
4.24user 0.06system 0:04.31elapsed 99%CPU (0avgtext+0avgdata 23264maxresident)k

Ok, another tiny improvement. We won a couple of percent, although not really worth mentioning. Now let's see what lxml's API can do for us.

First, let's look at the structure of the XML file. Nicely, the xmlgen tool has a mode for generating an indented version of the same file, which makes it easier to investigate. Here's the start of the indented version of the file (note that we are always parsing the smaller version of the file, which contains newlines but no indentation):

<?xml version="1.0" standalone="yes"?>
<site>
  <regions>
    <africa>
      <item id="item0">
        <location>United States</location>
        <quantity>1</quantity>
        <name>duteous nine eighteen </name>
        <payment>Creditcard</payment>
        <description>
          <parlist>
            <listitem>
              <text>…
…

The root tag is site, which then contains regions (apparently one per continent), then a series of item elements, which contain the location. In a real data file, it would probably be enough to only look at the africa region when looking for Africa as a location, but a) this is (pseudo-)randomly generated data, b) even "real" data isn't always clean, and c) a location "Africa" actually seems weird when the region is already africa …

Anyway. Let's assume we have to look through all regions to get a correct count. But given the structure of the item tag, we can simply select the location elements and do the following in lxml:

def count_locations(file_path, match):
    count = 0
    for event, elem in ET.iterparse(file_path, events=("end",), tag='location'):
        if elem.text and match in elem.text:
            count += 1
        elem.clear()
    return count

count = 92
3.06user 0.62system 0:03.68elapsed 99%CPU (0avgtext+0avgdata 1529292maxresident)k

That's a lot faster. But what happened to the memory? 1.5 GB? We used to be able to process the whole file with only 23 MiB peak!

The reason is that the loop now only runs for location elements, and everything else is only handled internally by the parser – and the parser builds an in-memory XML tree for us. The elem.clear() call, that we previously used for deleting used parts of that tree, is now only executed for the location, a pure text tag, and thus cleans up almost nothing. We need to take care to clean up more again, so let's intercept on the item and look for the location from there.

def count_locations(file_path, match):
    count = 0
    for _, elem in ET.iterparse(file_path, events=("end",), tag='item'):
        text = elem.findtext('location')
        if text and match in text:
            count += 1
        elem.clear()
    return count

count = 92
3.11user 0.37system 0:03.50elapsed 99%CPU (0avgtext+0avgdata 994280maxresident)k

Ok, almost as fast, but still – 1 GB of memory? Why doesn't the cleanup work? Let's look at the file structure some more.

$ egrep -n '^(  )?<' bench_pp.xml
1:<?xml version="1.0" standalone="yes"?>
2:<site>
3:  <regions>
2753228:  </regions>
2753229:  <categories>
2822179:  </categories>
2822180:  <catgraph>
2824181:  </catgraph>
2824182:  <people>
3614042:  </people>
3614043:  <open_auctions>
5520437:  </open_auctions>
5520438:  <closed_auctions>
6401794:  </closed_auctions>
6401795:</site>

Ah, so there is actually much more data in there that is completely irrelevant for our task! All we really need to look at is the first ~2.7 million lines that contain the regions data. The entire second half of the file is useless, and simply generates heaps of data that our cleanup code does not handle. Let's make use of that learning in our code. We can intercept on both the item and the regions tags, and stop as soon as the regions data section ends.

def count_locations(file_path, match):
    count = 0
    for _, elem in ET.iterparse(file_path, events=("end",), tag=('item', 'regions')):
        if elem.tag == 'regions':
            break
        text = elem.findtext('location')
        if text and match in text:
            count += 1
        elem.clear()
    return count

count = 92
1.22user 0.04system 0:01.27elapsed 99%CPU (0avgtext+0avgdata 22048maxresident)k

That's great! We're actually using less memory than in the beginning now, and managed to cut down the runtime from 4.6 seconds to 1.2 seconds. That's almost a factor of 4!

Let's try one more thing. We are already intercepting on two tag names, and then searching for a third one. Why not intercept on all three directly?

def count_locations(file_path, match):
    count = 0
    for _, elem in ET.iterparse(file_path, events=("end",),
                                tag=('item', 'location', 'regions')):
        if elem.tag == 'location':
            text = elem.text
            if text and match in text:
                count += 1
        elif elem.tag == 'regions':
            break
        else:
            elem.clear()
    return count

count = 92
1.10user 0.03system 0:01.13elapsed 99%CPU (0avgtext+0avgdata 21912maxresident)k

Nice. Another bit faster, and another bit less memory used.

Anything else we can do? Yes. We can tune the parser a little more. Since we're only interested in the non-empty text content inside of tags, we can ignore all newlines that appear in our input file between the tags. lxml's parser has an option for removing such blank text, which avoids creating an in-memory representation for it.

def count_locations(file_path, match):
    count = 0
    for _, elem in ET.iterparse(file_path, events=("end",),
                                tag=('item', 'location', 'regions'),
                                remove_blank_text=True):
        if elem.tag == 'location':
            text = elem.text
            if text and match in text:
                count += 1
        elif elem.tag == 'regions':
            break
        else:
            elem.clear()
    return count

count = 92
0.97user 0.02system 0:01.00elapsed 99%CPU (0avgtext+0avgdata 21928maxresident)k

While the overall memory usage didn't change, the avoided processing time for creating the useless text nodes and cleaning them up from memory is quite visible.

Overall, algorithmically improving our code and making better use of lxml's features gave us a speedup from initially 4.6 seconds down to one second. And we paid for that improvement with 4 additional lines of code inside our function. That's only half of the code which Eli's SAX based Go implementation needs (which, mind you, does not build an in-memory tree for you at all). And the Go code is only slightly faster than the initial Python implementations that we started from. Way to go! ;-)

Speaking of SAX, lxml also has a SAX interface. So let's compare how that performs.

import sys
import lxml.etree as ET

class Done(Exception):
    pass

class SaxCounter:
    in_location = False
    def __init__(self, match):
        self.count = 0
        self.match = match
        self.text = []
        self.data = self.text.append

    def start(self, tag, attribs):
        self.is_location = tag == 'location'
        del self.text[:]

    def end(self, tag):
        if tag == 'location':
            if self.text and self.match in ''.join(self.text):
                self.count += 1
        elif tag == 'regions':
            raise Done()

    def close(self):
        pass

def count_locations(file_path, match):
    target = SaxCounter(match)
    parser = ET.XMLParser(target=target)
    try:
        ET.parse(file_path, parser=parser)
    except Done:
        pass
    return target.count

count = count_locations(sys.argv[1], 'Africa')
print('count =', count)

count = 92
1.23user 0.02system 0:01.25elapsed 99%CPU (0avgtext+0avgdata 16060maxresident)k

And the exact same code works in ElementTree if you change the import again:

count = 92
1.83user 0.02system 0:01.85elapsed 99%CPU (0avgtext+0avgdata 10280maxresident)k

Also, removing the regions check from the end() SAX method above, thus reading the entire file, yields this for lxml:

count = 92
3.22user 0.04system 0:03.27elapsed 99%CPU (0avgtext+0avgdata 15932maxresident)k

and this for ElementTree:

count = 92
4.72user 0.07system 0:04.79elapsed 99%CPU (0avgtext+0avgdata 10300maxresident)k

Seeing the numbers in comparison to iterparse(), it does not seem worth the complexity, unless the memory usage is really, really pressing.

A final note: here's the improved ElementTree iterparse() implementation that also avoids parsing useless data.

import sys
import xml.etree.ElementTree as ET

def count_locations(file_path, match):
    count = 0
    for event, elem in ET.iterparse(file_path, events=("end",)):
        if elem.tag == 'location':
            if elem.text and match in elem.text:
                count += 1
        elif elem.tag == 'regions':
            break
        elem.clear()
    return count

count = count_locations(sys.argv[1], 'Africa')
print('count =', count)

count = 92
1.71user 0.02system 0:01.74elapsed 99%CPU (0avgtext+0avgdata 11876maxresident)k

And while not as fast as the lxml version, it still runs considerably faster than the original implementation. And uses less memory.

Learnings to take away:

Say what you want.
Stop when you have it.