Running coverty scan on lxml

Hearing a talk about static analysis at EuroPython 2014 and meeting Christian Heimes there (CPython core dev and member of the security response team) got us talking about running Coverty Scan on Cython generated code. They provide a free service for Open Source projects, most likely because there is a clear benefit in terms of marketing visibility and distributed filtering work on a large amount of code.

The problem with a source code generator is that you can only run the analyser on the generated code, so you need a real world project that uses the generator. The obvious choice for us was lxml, as it has a rather large code base with more than 230000 lines of C code, generated from some 20000 lines of Cython code. The first run against the latest lxml release got us about 1200 findings, but a quick glance through them showed that the bulk of them were false positives for the way Cython generates code for some Python constructs. There was also a large set of "dead code" findings that I had already worked on in Cython a couple of months ago. It now generates substantially less dead code. So I gave it another run against the current developer versions of both lxml and Cython.

The net result is that the number of findings went down to 428. A large subset of those relates to constant macros in conditions, which is what I use in lxml to avoid a need for C level #ifdefs. The C compiler is happy to discard this code, so Coverty's dead code finding is ok but not relevant. Other large sets of "dead code" findings are due to Cython generating generic error handling code in cases where an underlying C macro actually cannot fail, e.g. when converting a C boolean value to Python's constant True/False objects. So that's ok, too.

It's a bit funny that the tool complains about a single "{" being dead code, although it's followed immediately by a (used) label. That's not really an amount of code that I'd consider relevant for reporting.

On the upside, the tool found another couple of cases in the try-except implementation where Cython was generating dead code, so I was able to eliminate them. The advantage here is that a goto statement can be eliminated, which may leave the target label unused and can thus eliminate further code under that label that would be generated later but now can also be suppressed. Well, and generating less code is generally a good thing anyway.

Overall, the results make a really convincing case for Cython. Nothing of importance was found, and the few minor issues where Cython still generated more code than necessary could easily be eliminated, so that all projects that use the next version can just benefit. Compare that to manually written C extension code, where reference counting is a large source of errors and the verbose C-API of CPython makes the code substantially harder to get right and to maintain than the straight forward Python syntax and semantics of Cython. When run against the CPython code base for the first time, Coverty Scan found several actual bugs and even security issues. This also nicely matches the findings by David Malcolm and his GCC based analysis tool, who ended up using Cython generated code for eliminating false positives, rather than finding actual bugs in it.

Gesellschaft zum Wegwerfen

Frühstück im Hotel. Um mich herum alles voll mit mollig geformten Männern mittleren Alters, die stolz den selbst am Buffet gejagten Schinken auf einem See von Rührei über die ganze weite Strecke bis zu ihrem Platz schleppen. Frühsport kann eben nicht anstrengend genug sein, das dürfen andere ruhig sehen.

Neben mir zwei Frauen, vielleicht aus Südamerika, vielleicht arabisch. Unter der zentimeterdicken Schminkschicht nicht mehr zu unterscheiden. Vielleicht eher arabisch, Sprache klingt nicht nach Spanisch oder Portugiesisch, auch wenn das vielleicht nichts heißt. Vorher schon beiden beim Brottisch begegnet. Eine schaut verächtlich auf die ausgelegten Brote und Brötchen, greift dann in die Plastikware, nimmt je zwei eingeschweiste Päckchen von allem: Brandt-Zwieback, Pumpernickel, Knäckebrot. Pro Packung zwei Scheiben. Der Teller dazu wird mit Eiern, Quark und Gurke gefüllt.

Dann, am Platz, ein Probieren des Rühreis, offensichtlicher Mangel an Geschmackserlebnis. Mit viel Quark drauf werden ein paar Bissen Zwieback mit einem Löffelchen Rührei runtergewürgt, dann geht der ganze Teller per Bedienung in den Müll. Oben drauf schnell noch ein intaktes gekochtes Ei der zweiten Frau. Das war wohl nur unabsichtlich auf deren Teller gelandet, denn jetzt geht sie los und kommt mit einem fertig gepellten und geschnittenen Ei wieder. Klar, wäre ja auch viel zu viel Aufwand gewesen, das selbst zu öffnen, wozu gibt es schließlich Bedienstete.

Ich bin fertig mit diesem Frühstück.

Neulich bei Aldi

Ich muss es leider zugeben, ich gehe manchmal zu Aldi. Ungern, aber ich tue es. Noch.

Neulich finde ich ein halbes Kilo Tomaten in den Gemüseregalen. Packung eingerissen, aber Tomaten OK. Schließlich an der Kasse, sitzt dort eine Azubine. Sie zieht meine Einkäufe durch den Scanner, sieht dann die Verpackung der Tomaten, schaut mich mitleidig an und meint: „Wollen Sie sich neue holen?“. Wie immer war ich nicht schlagfertig genug, sondern habe nur „Nein“ gesagt. Nachher: geärgert. Klar.

Leute, ich will euren Müll doch nicht. Ich will die Tomaten. Wenn ihr die besser transportieren könnt, weil Plastik außen drum ist, dann ist das vielleicht für euch ganz nett, aber ich will das nicht. Und ganz besonders will ich nicht, dass ihr wertvolle Lebensmittel zu Müll macht, nur weil der wirkliche Müll außen rum nicht mehr eurer Vorstellung von schönem Müll entspricht. Wie krank kann man eigentlich sein?

Finding elements in XML

With more than half a million PyPI downloads every month, the lxml toolkit is the most widely used external XML library for Python. It's even one of the most often downloaded packages overall, with apparently millions of users world wide. If you add to that the fact that it's mostly compatible with the standard library ElementTree package, it shows that its API is highly predominant in the Python world for anything that's XML processing.

However, I keep seeing people ask about the best way to find stuff in their XML documents. I guess that's because there are a couple of ways to do that: by (recursively) traversing the tree, by iterating over the tree, or by using the find() methods, XPath and even CSS selectors (which now rely on the external cssselect package). There are many ways to do this, simply because there are so many use cases. XPath is obviously the does-it-all tool here, and many people would quickly recommend an XPath expression when you give them a search problem, but it's also a lot like regular expressions for XML: now you have two problems.

Often, tree iteration is actually a much better way to do this than XPath. It's simpler and also faster in most cases. When looking for a specific tag, you can simply say:

for paragraph in tree.iter('p'):
    ...  # do something with the <p> element

Note that this iterates over the tree and processes elements as they are found, whereas the XPath engine always evaluates the expression exhaustively and only then returns with the complete list of results that it found. Expressing in XPath that only one element is wanted, i.e. making it short-circuit, can be quite a challenge.

Even when looking for a set of different tags, tree iteration will work just beautifully:

for block in tree.iter('p', 'div', 'pre'):
    if block.tag == 'p': ...
    elif block.tag == 'pre': ...
    else: ...

For the more involved tasks, however, such as finding an element in a specific subtree below sme parent, or selecting elements by attributes or text, you will quickly end up wanting to use a path language anyway. Lxml comes with two implementations here: ElementPath (implemented in Python) and XPath (implemented in C by libxml2). At this point, you might think that XPath, being implemented in C, should generally be a better choice, but that's actually not true. In fact, the ElementPath implementation in lxml makes heavy use of lxml's optimised tree iterators and tends to be faster than the generic XPath engine. Especially when only looking for one specific element, something like this is usually much faster than using the same expression in XPath:

target = root.find('.//parent/child//target')

One really nice feature of ElementPath is that it uses fully qualified tag names rather than resorting to externally defined prefixes. This means that a single expression in a self-contained string is enough to find what you are looking for:

target = root.find('.//{http://some/namespace}parent/*/{http://other/namespace}target')

On the downside, the predicate support in ElementPath is substantially more limited than in real XPath. Functions do not exist at all, for example. However, for simple things like testing attribute values, you don't have to sacrifice the performance of ElementPath.

for link in root.iterfind('.//a[@href]'):
    print link.get('href')

This example only tests for the existence of the attribute, which may still have an empty value. But you can also look for a specific value:

for link in root.iterfind('.//a[@href = ""]'):
    print("link to lxml's homepage found")

So, while XPath may appear like the one thing that helps with all your search needs, it's often better to use a simpler tool for the job.

discontinuing Flattr for lxml

I removed lxml from Flattr. The outcome is just way below ground. The Flattr revenue for the project over the last years was a pretty constant 6 EUR per month. Minus taxes. And Flattr already kept their 10% share, which is what bugs me most, I guess. That leaves Paypal as the current (and certainly better) alternative if you want to support the development behind lxml.

Cython 0.20 will make everything better :)

It`s been a while since the last major Cython release, so here`s a little update on what`s going to be new in 0.20. We`ve managed to improve the compiler in several important areas.

I`ve been working on making Cython a serious string processing tool for a couple of releases now, and the next release will add another piece to the puzzle. Cython has four string types now: bytes, str, unicode and, newly added, basestring. The special types str and basestring help with writing portable code for both Python 2 and 3. The str type expands to bytes in Py2 and to the Unicode string type in Py3, i.e. to whatever the platform calls "str" natively. This is really nice, because it allows you to get the native string type by simply saying "str" or by writing an unprefixed string literal. However, the problem was that when you type a variable or function argument as str, you couldn't assign a Unicode string to it in Py2, thus unnecessarily rejecting some text input. So I added the basestring type, which accepts both str and unicode in Py2, but only Unicode strings in Py3. Anything else will raise a TypeError on assignment.

Another corner where string processing was improved is the bytearray type. Cython recognises it as a known type now and allows it to coerce from and to C's char* values for much faster interaction with C/C++ code.

Cython`s own internal types, such as the function type or generators, are shared between modules now. Basically, the first imported Cython module in a running interpreter registers a shared (versioned) module that all later Cython modules use for initialisation. This allows multiple installed modules to recognise each others types, which allows for much faster interoperability.

Type inference has also gained new capabilities. It`s based on control flow analysis now (instead of function-global type analysis), which allows for more fine-grained typing. It`s also used in some cases to remember assignments to local constants, which allows Cython to see the current value of that variable and to optimise code based on it. For example, assignments of bound methods (like "append = some_list.append") can be traced back to the underlying object now, which improves the performance of some untyped Python code out there.

As for usability, the compiler has gained a new frontend script "cythonize" that mimics the capabilities of the cythonize() function used in scripts. To compile a module, or even an entire package down to extension modules, you can simply say "cythonize -i some/package", where "-i" requests to compile and build the extensions "in place", i.e. next to their sources. Oh, and while we`re at it, compiling packages (i.e. "" files) actually works in Python 3.3 now.

Extension types behave a bit more smartly at garbage collection time. When an attribute is declared with a known builtin type that cannot produce reference cycles (e.g. strings), then it is automatically excluded from the garbage collection visitor. This reduces the traversal overhead when searching for unreachable objects. It can even exclude a type completely from garbage collection traversal if none of its object attributes requires it. And for the previously tricky case that object attributes had to be accessed at finalisation time, there is a new type decorator "@cython.no_gc_clear" that prevents them from being cleared ahead of time when they participate in reference cycles. Certainly way better than a crash.

And finally, the try-finally statement has been optimised to handle each exit case separately, which allows the C compiler to generate much faster code for the "normal" cases.

Since 0.20 isn`t out yet, this list might still grow a bit, but that`s what is there now. Not bad, if you ask me.

No Cython talk at PyCon-US 2014

I recently got the rejection notice for the Cython talk I had submitted to next year`s PyCon-"US" (in Montréal). Since none of the Cython core developers had ever managed to go to that conference before, and thus Cython had never had a well founded talk there (and only one talk during the last years that at least mentioned it in its title), I had hoped to catch up with that a bit and give a general overview of what Cython is and when and how to use it for speeding up Python code or for talking to external libraries.

It`s sad to get such a talk rejected, but what`s really worse was their reasoning. It was rejected due to potential lack of interest, completely misjudging it as an "internals" talk for some reason. Seriously? Lack of interest in the most widely used static Python compiler? In the only Python AoT compiler that has a substantial user base? And that loads and loads of people in the Scientific Python community use for their high performance calculations?

Instead, the list of accepted talks includes major Python topics like Lisp, JavaScript and Garbage Collectors. I`m sure those will attract a much broader audience at a Python conference than a niche topic like, you know, speeding up Python code.

Faster stdlib logging

I keep reiterating that it would be nice if a couple of Python stdlib modules got compiled by Cython during Python's installation, simply because it makes them faster right away, without sacrificing compatibility. Besides the difflib module, another nice example I found is the logging module. The logging benchmarks in Python's benchmark suite run between 20% and 50% faster when the module is compiled. Especially the silent logging case is interesting with its ~50%, because a lot of log messages really never end up in a log somewhere, so you can leave more log statements in your code to activate them at need. And I'm sure there's still a bit more to gain here by adding a couple of static type declarations in the right spots of that module. Feel free to give it a try.

Cython vs. binding generators

This year's PyCon-US included a talk with the rather lurid title "Cython vs. SWIG, Fight!". I'm sure many of the attendees expected something different than they actually got, which was mostly a featurewise comparison at the "this is how some basic examples might look" level. In fact, I think the difference between Cython and SWIG, or actually Cython and any of those binding generators or wrapping tools, can be summarised in two words. They are binding generators, whereas Cython is a programming language, with all its expressiveness.

Cython's main selling point is that it completely blurs the border between Python and C. A binding generator, or a foreign language interface tool like ctypes or cffi, will only ever give you the choice of two extremes:

  1. write your code in Python and let it talk to wrapped C, or
  2. write your code in C and wrap it.

Either easy or hard, nothing in between, and it's not always your choice. They give you no help at all with the parts where it needs to be hard for some reason, and some do not even reduce that unfriendly feeling of working with truly foreign code. Even worse, these tools usually have a very specific idea about how the wrapping should look like, so if you come to a point where you're not happy with the way they work, you'll have to start working against the tool.

Cython is different. With Cython, you can

  1. write your code in Python and let it talk to wrapped C
  2. compile your Python code
  3. add static type declarations to your Python code to specialise and speed up the result of the compilation
  4. change your Python code file extension to ".pyx" to start using elements of the Cython language that let your code talk to C directly (note that this makes it Cython code)
  5. write your code in the Cython language and freely mix elements of Python and C, talking to both sides natively
  6. write your code in C and wrap it

So it opens up that entire huge grey area between those two extremes that you get from other tools. It lets you freely choose the tradeoff between simplicity and speed, and makes it very easy to move gradually between one and the other.

Want to reduce complexity and use fast, high-level tools? Use Python constructs. Compiling them in Cython makes them even faster by statically analysing and optimising your code and inferring types for you. And it allows you to give the compiler explicit static type hints that reduce overhead and specialise your code even further.

Want low-level speed? Move closer to C constructs, in exactly those areas in your code that need it. Any point along that gradient is right at your finger tips. Need to talk to C, C++, Fortran, maybe other languages? Without having to go through a level of indirection that bites right into your performance benefit? You can. Cython makes this easy, actually "normal".

And, we're always interested in improving the "static type declarations in Python code" kind of usage (which we call "pure Python mode"), so if you want to help extending the expressiveness of Cython's type declarations for pure Python code to further blur the gradients for users, you should talk to us about it. We have a mailing list.

Bugs in CPython extension modules

A while back, I wrote up my opinion on writing CPython extensions by hand, using the C-API. Most of the replies I got were asking for a proof, but the article was more of a summary of my prior experience than anything really new.

Now, David Malcolm, author of the GCC Python plugin, has given a talk at this year's PyCon-US where he used a static analysis tool chain that he's been working on based on his GCC plugin to find bugs in CPython extension modules. Being a Fedora developer, he ran it against the wealth of binary Python packages in that distribution and ended up finding a lot of bugs. Very unsurprisingly to me, most of them were refcount bugs, mainly memory leaks, especially in error handling cases, but also lots of other issues with reference handling, e.g. missing NULL/error tests etc. At the end of the talk, he was asked what bugs his tools found not only in manually written code but in generated code, specifically C code generated by Cython. He answered that it was rather the other way round: he had used Cython generated code to prune false positives from his analysis tool, because it was quite obvious that the code that Cython generated was actually correct.

I think that nicely supports what I wrote in my last post.