Running coverty scan on lxml

Hearing a talk about static analysis at EuroPython 2014 and meeting Christian Heimes there (CPython core dev and member of the security response team) got us talking about running Coverty Scan on Cython generated code. They provide a free service for Open Source projects, most likely because there is a clear benefit in terms of marketing visibility and distributed filtering work on a large amount of code.

The problem with a source code generator is that you can only run the analyser on the generated code, so you need a real world project that uses the generator. The obvious choice for us was lxml, as it has a rather large code base with more than 230000 lines of C code, generated from some 20000 lines of Cython code. The first run against the latest lxml release got us about 1200 findings, but a quick glance through them showed that the bulk of them were false positives for the way Cython generates code for some Python constructs. There was also a large set of "dead code" findings that I had already worked on in Cython a couple of months ago. It now generates substantially less dead code. So I gave it another run against the current developer versions of both lxml and Cython.

The net result is that the number of findings went down to 428. A large subset of those relates to constant macros in conditions, which is what I use in lxml to avoid a need for C level #ifdefs. The C compiler is happy to discard this code, so Coverty's dead code finding is ok but not relevant. Other large sets of "dead code" findings are due to Cython generating generic error handling code in cases where an underlying C macro actually cannot fail, e.g. when converting a C boolean value to Python's constant True/False objects. So that's ok, too.

It's a bit funny that the tool complains about a single "{" being dead code, although it's followed immediately by a (used) label. That's not really an amount of code that I'd consider relevant for reporting.

On the upside, the tool found another couple of cases in the try-except implementation where Cython was generating dead code, so I was able to eliminate them. The advantage here is that a goto statement can be eliminated, which may leave the target label unused and can thus eliminate further code under that label that would be generated later but now can also be suppressed. Well, and generating less code is generally a good thing anyway.

Overall, the results make a really convincing case for Cython. Nothing of importance was found, and the few minor issues where Cython still generated more code than necessary could easily be eliminated, so that all projects that use the next version can just benefit. Compare that to manually written C extension code, where reference counting is a large source of errors and the verbose C-API of CPython makes the code substantially harder to get right and to maintain than the straight forward Python syntax and semantics of Cython. When run against the CPython code base for the first time, Coverty Scan found several actual bugs and even security issues. This also nicely matches the findings by David Malcolm and his GCC based analysis tool, who ended up using Cython generated code for eliminating false positives, rather than finding actual bugs in it.