Faster Python calls in Cython 0.21

I spent some time during the last two weeks reducing the call overhead for Python functions and methods in Cython. It was already quite low compared to CPython before, about 30-40% faster, but profiling then made me stumble over the fact that method calls in CPython really just do one thing: they repack the argument tuple and prepend the 'self' object to it. However, that is done right after Cython has carefully packed up exactly that argument tuple in the first place, so by simply inlining what PyMethodObject does, we can avoid packing tuples twice.

Avoiding to create a PyMethodObject at all may also appear as an interesting goal, but doing that is totally not easy (it happens during attribute lookup) and it's also most likely not worth it as method objects are created from a freelist, which makes their instantiation very fast. Method objects also hold actual state that the caller must receive: the underlying function and the self object. So getting rid of them will severly complicate things without a major gain to expect.

Another obvious optimisation, however, is that Python code calls into C implemented functions quite often, and if those are implemented as specialised functions that take exactly one or no argument (METH_O/METH_NOARGS), then the tuple packing and unpacking can be avoided completely. Together with the method call optimisation, this means that Cython can now call very simple methods without creating an argument tuple, and less simple ones without redundantly creating a second argument tuple.

I implemented these optimisations and they immediately blew up the method call micro benchmarks in Python's benchmark suite from about 1/3 to 2-3 times faster than CPython 3.5 (pre). Those are only simple micro benchmarks, so any real world code will benefit substantially less overall. However, it turned out that a couple of benchmarks in the suite that are based on real production code ended up loosing 5-15% of their total runtime. That's quite remarkable, given that the code they call actually does something (much) more heavy weight than the call overhead itself. I'm still tuning it a bit, but so far am really happy with this result.