Cython, pybind11, cffi – which tool should you choose?

In and after the conference talks that I give about Cython, I often get the question how it compares to other tools like pybind11 and cffi. There are others, but these are definitely the three that are widely used and "modern" in the sense that they provide an efficient user experience for today's real-world problems. And as with all tools from the same problem space, there are overlaps and differences. First of all, pybind11 and cffi are pure wrapping tools, whereas Cython is a Python compiler and a complete programming language that is used to implement actual functionality and not just bind to it. So let's focus on the area where the three tools compete: extending Python with native code and libraries.

Using native code from the Python runtime (specifically CPython) has been at the heart of the Python ecosystem since the very early days. Python is a great language for all sorts of programming needs, but it would not have the success and would not be where it is today without its great ecosystem that is heavily based on fast, low-level, native code. And the world of computing is full of such code, often decades old, heavily tuned, well tested and production proven. Looking at indicators like the TIOBE Index suggests that low-level languages like C and C++ are becoming more and more important again even in the last years, decades after their creation.

Today, no-one would attempt to design a (serious/practical) programming language anymore that does not come out of the box with a complete and easy to use way to access all that native code. This ability is often referred to as an FFI, a foreign function interface. Rust is an excellent example for a modern language that was designed with that ability in mind. The FFI in LuaJIT is a great design of a fast and easy to use FFI for the Lua language. Even Java and its JVM, which are certainly not known for their ease of code reuse, have provided the JNI (Java Native Interface) from the early days. CPython, being written in C, has made it very easy to interface with C code right from the start, and above all others the whole scientific computing and big data community has made great use of that over the past 25 years.

Over time, many tools have aimed to simplify the wrapping of external code. The venerable SWIG with its long list of supported target languages is clearly worth mentioning here. Partially a successor to SWIG (and sip), shiboken is a C++ bindings generator used by the PySide project to auto-create wrapper code for the large Qt C++ API.

A general shortcoming of all wrapper generators is that many users eventually reach the limits of their capabilities, be it in terms of performance, feature support, language integration to one side or the other, or whatever. From that point on, users start fighting the tool in order to make it support their use case at hand, and it is not unheard of that projects start over from scratch with a different tool. Therefore, most projects are better off starting directly with a manually written wrapper, at least when the part of the native API that they need to wrap is not prohibitively vast.

The lxml XML toolkit is an excellent example for that. It wraps libxml2 and libxslt with their huge low-level C-APIs. But if the project had used a wrapper generator to wrap it for Python, mapping this C-API to Python would have made the language integration of the Python-level API close to unusable. In fact, the whole project started because generated Python bindings for both already existed that were like the thrilling embrace of an exotic stranger (Mark Pilgrim). And beautifying the API at the Python level by adding another Python wrapper layer would have countered the advantages of a generated wrapper and also severely limited its performance. Despite the sheer vastness of the C-API that it wraps, the decision for manual wrapping and against a wrapper generator was the foundation of a very fast and highly pythonic tool.

Nowadays, three modern tools are widely used in the Python community that support manual wrapping: Cython, cffi and pybind11. These three tools serve three different sides of the need to extend (C)Python with native code.

  • Cython is Python with native C/C++ data types.

    Cython is a static Python compiler. For people coming from a Python background, it is much easier to express their coding needs in Python and then optimising and tuning them, than to rewrite them in a foreign language. Cython allows them to do that by automatically translating their Python code to C, which often avoids the need for an implementation in a low-level language.

    Cython uses C type declarations to mix C/C++ operations into Python code freely, be it the usage of C/C++ data types and containers, or of C/C++ functions and objects defined in external libraries. There is a very concise Cython syntax that uses special additional keywords (cdef) outside of Python syntax, as well as ways to declare C types in pure Python syntax. The latter allows writing type annotated Python code that gets optimised into fast C level when compiled by Cython, but that remains entirely pure Python code that can be run, analysed ad debugged with the usual Python tools.

    When it comes to wrapping native libraries, Cython has strong support for designing a Python API for them. Being Python, it really keeps the developer focussed on the usage from the Python side and on solving the problem at hand, and takes care of most of the boilerplate code through automatic type conversions and low-level code generation. Its usage is essentially writing C code without having to write C code, but remaining in the wonderful world of the Python language.

  • pybind11 is modern C++ with Python integration.

    pybind11 is the exact opposite of Cython. Coming from C++, and targeting C++ developers, it provides a C++ API that wraps native functions and classes into Python representations. For that, it makes good use of the compile time introspection features that were added to C++11 (hence the name). Thus, it keeps the user focussed on the C++ side of things and takes care of the boilerplate code for mapping it to a Python API.

    For everyone who is comfortable with programming in C++ and wants to make direct use of all C++ features, pybind11 is the easiest way to make the C++ code available to Python.

  • CFFI is Python with a dynamic runtime interface to native code.

    cffi then is the dynamic way to load and bind to external shared libraries from regular Python code. It is similar to the ctypes module in the Python standard library, but generally faster and easier to use. Also, it has very good support for the PyPy Python runtime, still better than what Cython and pybind11 can offer. However, the runtime overhead prevents it from coming any close in performance to the statically compiled code that Cython and pybind11 generate for CPython. And the dependency on a well-defined ABI (binary interface) means that C++ support is mostly lacking.

    As long as there is a clear API-to-ABI mapping of a shared library, cffi can directly load and use the library file at runtime, given a header file description of the API. In the more complex cases (e.g. when macros are involved), cffi uses a C compiler to generate a native stub wrapper from the description and uses that to communicate with the library. That raises the runtime dependency bar quite a bit compared to ctypes (and both Cython and pybind11 only need a C compiler at build time, not at runtime), but on the other hand also enables wrapping library APIs that are difficult to use with ctypes.

This list shows the clear tradeoffs of the three tools. If performance is not important, if dynamic runtime access to libraries is an advantage, and if users prefer writing their wrapping code in Python, then cffi (or even ctypes) will do the job, nicely and easily. Otherwise, users with a strong C++ background will probably prefer pybind11 since it allows them to write functionality and wrapper code in C++ without switching between languages. For users with a Python background (or at least not with a preference for C/C++), Cython will be very easy to learn and use since the code remains Python, but gains the ability to do efficient native C/C++ operations at any point.

What CPython could use Cython for

There has been a recent discussion about using Cython for CPython development. I think this is a great opportunity for the CPython project to make more efficient use of its scarcest resource: developer time of its spare time contributors and maintainers.

The entry level for new contributors to the CPython project is often perceived to be quite high. While many tasks are actually beginner friendly, such as helping with the documentation or adding features to the Python modules in the stdlib, such important tasks as fixing bugs in the core interpreter, working on data structures, optimising language constructs, or improving the test coverage of the C-API require a solid understanding of C and the CPython C-API.

Since a large part of CPython is implemented in C, and since it exposes a large C-API to extensions and applications, C level testing is key to providing a correct and reliable native API. There were a couple of cases in the past years where new CPython releases actually broke certain parts of the C-API, and it was not noticed until people complained that their applications broke when trying out the new release. This is because the test coverage of the C-API is much lower than the well tested Python level and standard library tests of the runtime. And the main reason for this is that it is much more difficult to write tests in C than in Python, so people have a high incentive to get around it if they can. Since the C-API is used internally inside of the runtime, it is often assumed to be implicitly tested by the Python tests anyway, which raises the bar for an explicit C test even further. But this implicit coverage is not always given, and it also does not reduce the need for regression tests. Cython could help here by making it easier to write C level tests that integrate nicely with the existing Python unit test framework that the CPython project uses.

Basically, writing a C level test in Cython means writing a Python unittest function and then doing an explicit C operation in it that represents the actual test code. Here is an example for testing the PyList_Append C-API function:

from cpython.object cimport PyObject
from cpython.list cimport PyList_Append

def test_PyList_Append_on_empty_list():
    # setup code
    l = []
    assert len(l) == 0
    value = "abc"
    pyobj_value = <PyObject*> value
    refcount_before = pyobj_value.ob_refcnt

    # conservative test call, translates to the expected C code,
    # although with automatic exception propagation if it returns -1:
    errcode = PyList_Append(l, value)

    # validation
    assert errcode == 0
    assert len(l) == 1
    assert l[0] is value
    assert pyobj_value.ob_refcnt == refcount_before + 1

In the Cython project itself, what we actually do is to write doctests. The functions and classes in a test module are compiled with Cython, and the doctests are then executed in Python, and call the Cython implementations. This provides a very nice and easy way to compare the results of Cython operations with those of Python, and also trivially supports data driven tests, by calling a function multiple times from a doctest, for example:

from cpython.number cimport PyNumber_Add

def test_PyNumber_Add(a, b):
    """
    >>> test_PyNumber_Add('abc', 'def')
    'abcdef'
    >>> test_PyNumber_Add('abc', '')
    'abc'
    >>> test_PyNumber_Add(2, 5)
    7
    >>> -2 + 5
    3
    >>> test_PyNumber_Add(-2, 5)
    3
    """
    # The following is equivalent to writing "return a + b" in Python or Cython.
    return PyNumber_Add(a, b)

This could even trivially be combined with hypothesis and other data driven testing tools.

But Cython's use cases are not limited to testing. Maintenance and feature development would probably benefit even more from a reduced entry level.

Many language optimisations are applied in the AST optimiser these days, and that is implemented in C. However, these tree operations can be fairly complex and are thus non-trivial to implement. Doing that in Python rather than C would be much easier to write and maintain, but since this code is a part of the Python compilation process, there's a chicken-and-egg problem here in addition to the performance problem. Cython could solve both problems and allow for more far-reaching optimisations by keeping the necessary transformation code readable.

Performance is also an issue in other parts of CPython, namely the standard library. Several stdlib modules are compute intensive. Many of them have two implementations: one in Python and a faster one in C, a so-called accelerator module. This means that adding a feature to these modules requires duplicate effort, the proficiency in both Python and C, and a solid understanding of the C-API, reference counting, garbage collection, and what not. On the other hand, many modules that could certainly benefit from native performance lack such an accelerator, e.g. difflib, textwrap, fractions, statistics, argparse, email, urllib.parse and many, many more. The asyncio module is becoming more and more important these days, but its native accelerator only covers a very small part of its large functionality, and it also does not expose a native API that performance hungry async tools could hook into. And even though the native accelerator of the ElementTree module is an almost complete replacement, the somewhat complex serialisation code is still implemented completely in Python, which shows in comparison to the native serialisation in lxml.

Compiling these modules with Cython would speed them up, probably quite visibly. For this use case, it is possible to keep the code entirely in Python, and just add enough type declarations to make it fast when compiled. The typing syntax that PEP-484 and PEP-526 added to Python 3.6 makes this really easy and straight forward. A manually written accelerator module could thus be avoided, and therefore a lot of duplicated functionality and maintenance overhead.

Feature development would also be substantially simplified, especially for new contributors. Since Cython compiles Python code, it would allow people to contribute a Python implementation of a new feature that compiles down to C. And we all know that developing new functionality is much easier in Python than in C. The remaining task is then only to optimise it and not to rewrite it in a different language.

My feeling is that replacing some parts of the CPython C development with Cython has the potential to bring a visible boost for the contributions to the CPython project.

Update 2018-09-12: Jeroen Demeyer reminded me that I should also mention the ease of wrapping external native libraries. While this is not something that is a big priority for the standard library anymore, it is certainly true that modules like sqlite (which wraps sqlite3), ssl (OpenSSL), expat or even the io module (which wraps system I/O capabilities) would have been easier to write and maintain in Cython than in C. Especially I/O related code is often intensive in error handling, which is nicer to do with raise and f-strings than error code passing in C.

A really fast Python web server with Cython

Shortly after I wrote about speeding up Python web frameworks with Cython, Nexedi posted an article about their attempt to build a fast multicore web server for Python that can compete with the performance of compiled coroutines in the Go language.

Their goal is to use Cython to build a web framework around a fast native web server, and to use Cython's concurrency and coroutine support to gain native performance also in the application code, without sacrificing the readability that Python provides.

Their experiments look very promising so far. They managed to process 10K requests per second concurrently, which actually do real processing work. That is worth noting, because many web server benchmarks out there content themselves with the blank response time for a "hello world", thus ignoring any concurrency overhead etc. For that simple static "Hello world!", they even got 400K requests per second, which shows that this is not a very realistic benchmark. Under load, their system seems to scale pretty linearly with the number of threads, also not a given among web frameworks.

I might personally get involved in further improving Cython for this kind of concurrent, async applications. Stay tuned.

Cython for web frameworks

I'm excited to see the Python web community pick up Cython more and more to speed up their web frameworks.

uvloop as a fast drop-in replacement for asyncio has been around for a while now, and it's mostly written in Cython as a wrapper around libuv. The Falcon web framework optionally compiles itself with Cython, while keeping up support for PyPy as a plain Python package. New projects like Vibora show that it pays off to design a framework for both (Flask-like) simplicity and (native) speed from the ground up to leverage Cython for the critical parts. Quote of the day:

"300.000 req/sec is a number comparable to Go's built-in web server (I'm saying this based on a rough test I made some years ago). Given that Go is designed to do exactly that, this is really impressive. My kudos to your choice to use Cython." – Reddit user 'beertown'.

Alex Orlov gave a talk at the PyCon-US in 2017 about using Cython for more efficient code, in which he mentioned the possibility to speed up the Django URL dispatcher by 3x, simply by compiling the module as it is.

Especially in async frameworks, minimising the time spent in processing (i.e. outside of the I/O-Loop) is critical for the overall responsiveness and performance. Anton Caceres and I presented fast async code with Cython at EuroPython 2016, showing how to speed up async coroutines by compiling and optimising them.

In order to minimise the processing time on the server, many template engines use native accelerators in one way or another, and writing those in Cython (instead of C/C++) is a huge boost in terms of maintenance (and probably also speed). But several engines also generate Python code from a templating language, and those templates tend to be way more static than not (they are rarely runtime generated themselves). Therefore, compiling the generated template code, or even better, directly targeting Cython with the code generation instead of just plain Python has the potential to speed up the template processing a lot. For example, Cython has very fast support for PEP-498 f-strings and even transforms some '%'-formatting patterns into them to speed them up (also in code that requires backwards compatibility with older Python versions). That can easily make a difference, but also the faster function and method calls or looping code that it generates.

I'm sure there's way more to come and I'm happily looking forward to all those cool developments in the web area that we are only just starting to see appear.

Update 2018-07-15: Nexedi posted an article about their attempts to build a fast web server using Cython, both for the framework layer and the processing at the application layer. Worth keeping an eye on.