Locally loading DTDs from XML catalogs with lxml

It seems that it is not obvious to all lxml users how DTDs and external entities are loaded by an XML processor. Specifically, if the system is misconfigured, it can happen that lxml fails to parse a document that needs a DTD or that it tries to load the DTD from the network repeatedly, when the no_network parser option is set to False (obviously, network access is blocked by default).

I commented on this on the lxml mailing list in 2008 when there was a discussion about high web traffic at the W3C due to excessive DTD loading, which was also attributed to parts of the Python standard library.

The right way to handle this (in general, but especially for lxml) is to configure the XML catalogs on the local system. The libxml2 site has some documentation on how to do this. The advantage of using catalogs is that most XML tools will use them when available, so it is a system wide fix for the problem. Most Linux installations come with readily configured XML catalogs, but other systems may have to get fixed up.