Finding elements in XML

Stefan Behnel

2014-04-27 11:09

With more than half a million PyPI downloads every month, the lxml toolkit is the most widely used external XML library for Python. It's even one of the most often downloaded packages overall, with apparently millions of users world wide. If you add to that the fact that it's mostly compatible with the standard library ElementTree package, it shows that its API is highly predominant in the Python world for anything that's XML processing.

However, I keep seeing people ask about the best way to find stuff in their XML documents. I guess that's because there are a couple of ways to do that: by (recursively) traversing the tree, by iterating over the tree, or by using the find() methods, XPath and even CSS selectors (which now rely on the external cssselect package). There are many ways to do this, simply because there are so many use cases. XPath is obviously the does-it-all tool here, and many people would quickly recommend an XPath expression when you give them a search problem, but it's also a lot like regular expressions for XML: now you have two problems.

Often, tree iteration is actually a much better way to do this than XPath. It's simpler and also faster in most cases. When looking for a specific tag, you can simply say:

for paragraph in tree.iter('p'):
    ...  # do something with the <p> element

Note that this iterates over the tree and processes elements as they are found, whereas the XPath engine always evaluates the expression exhaustively and only then returns with the complete list of results that it found. Expressing in XPath that only one element is wanted, i.e. making it short-circuit, can be quite a challenge.

Even when looking for a set of different tags, tree iteration will work just beautifully:

for block in tree.iter('p', 'div', 'pre'):
    if block.tag == 'p': ...
    elif block.tag == 'pre': ...
    else: ...

For the more involved tasks, however, such as finding an element in a specific subtree below sme parent, or selecting elements by attributes or text, you will quickly end up wanting to use a path language anyway. Lxml comes with two implementations here: ElementPath (implemented in Python) and XPath (implemented in C by libxml2). At this point, you might think that XPath, being implemented in C, should generally be a better choice, but that's actually not true. In fact, the ElementPath implementation in lxml makes heavy use of lxml's optimised tree iterators and tends to be faster than the generic XPath engine. Especially when only looking for one specific element, something like this is usually much faster than using the same expression in XPath:

target = root.find('.//parent/child//target')

One really nice feature of ElementPath is that it uses fully qualified tag names rather than resorting to externally defined prefixes. This means that a single expression in a self-contained string is enough to find what you are looking for:

target = root.find('.//{http://some/namespace}parent/*/{http://other/namespace}target')

On the downside, the predicate support in ElementPath is substantially more limited than in real XPath. Functions do not exist at all, for example. However, for simple things like testing attribute values, you don't have to sacrifice the performance of ElementPath.

for link in root.iterfind('.//a[@href]'):
    print link.get('href')

This example only tests for the existence of the attribute, which may still have an empty value. But you can also look for a specific value:

for link in root.iterfind('.//a[@href = "http://lxml.de/"]'):
    print("link to lxml's homepage found")

So, while XPath may appear like the one thing that helps with all your search needs, it's often better to use a simpler tool for the job.