What CPython could use Cython for

Stefan Behnel

2018-09-09 22:42

There has been a recent discussion about using Cython for CPython development. I think this is a great opportunity for the CPython project to make more efficient use of its scarcest resource: developer time of its spare time contributors and maintainers.

The entry level for new contributors to the CPython project is often perceived to be quite high. While many tasks are actually beginner friendly, such as helping with the documentation or adding features to the Python modules in the stdlib, such important tasks as fixing bugs in the core interpreter, working on data structures, optimising language constructs, or improving the test coverage of the C-API require a solid understanding of C and the CPython C-API.

Since a large part of CPython is implemented in C, and since it exposes a large C-API to extensions and applications, C level testing is key to providing a correct and reliable native API. There were a couple of cases in the past years where new CPython releases actually broke certain parts of the C-API, and it was not noticed until people complained that their applications broke when trying out the new release. This is because the test coverage of the C-API is much lower than the well tested Python level and standard library tests of the runtime. And the main reason for this is that it is much more difficult to write tests in C than in Python, so people have a high incentive to get around it if they can. Since the C-API is used internally inside of the runtime, it is often assumed to be implicitly tested by the Python tests anyway, which raises the bar for an explicit C test even further. But this implicit coverage is not always given, and it also does not reduce the need for regression tests. Cython could help here by making it easier to write C level tests that integrate nicely with the existing Python unit test framework that the CPython project uses.

Basically, writing a C level test in Cython means writing a Python unittest function and then doing an explicit C operation in it that represents the actual test code. Here is an example for testing the PyList_Append C-API function:

from cpython.object cimport PyObject
from cpython.list cimport PyList_Append

def test_PyList_Append_on_empty_list():
    # setup code
    l = []
    assert len(l) == 0
    value = "abc"
    pyobj_value = <PyObject*> value
    refcount_before = pyobj_value.ob_refcnt

    # conservative test call, translates to the expected C code,
    # although with automatic exception propagation if it returns -1:
    errcode = PyList_Append(l, value)

    # validation
    assert errcode == 0
    assert len(l) == 1
    assert l[0] is value
    assert pyobj_value.ob_refcnt == refcount_before + 1

In the Cython project itself, what we actually do is to write doctests. The functions and classes in a test module are compiled with Cython, and the doctests are then executed in Python, and call the Cython implementations. This provides a very nice and easy way to compare the results of Cython operations with those of Python, and also trivially supports data driven tests, by calling a function multiple times from a doctest, for example:

from cpython.number cimport PyNumber_Add

def test_PyNumber_Add(a, b):
    """
    >>> test_PyNumber_Add('abc', 'def')
    'abcdef'
    >>> test_PyNumber_Add('abc', '')
    'abc'
    >>> test_PyNumber_Add(2, 5)
    7
    >>> -2 + 5
    3
    >>> test_PyNumber_Add(-2, 5)
    3
    """
    # The following is equivalent to writing "return a + b" in Python or Cython.
    return PyNumber_Add(a, b)

This could even trivially be combined with hypothesis and other data driven testing tools.

But Cython's use cases are not limited to testing. Maintenance and feature development would probably benefit even more from a reduced entry level.

Many language optimisations are applied in the AST optimiser these days, and that is implemented in C. However, these tree operations can be fairly complex and are thus non-trivial to implement. Doing that in Python rather than C would be much easier to write and maintain, but since this code is a part of the Python compilation process, there's a chicken-and-egg problem here in addition to the performance problem. Cython could solve both problems and allow for more far-reaching optimisations by keeping the necessary transformation code readable.

Performance is also an issue in other parts of CPython, namely the standard library. Several stdlib modules are compute intensive. Many of them have two implementations: one in Python and a faster one in C, a so-called accelerator module. This means that adding a feature to these modules requires duplicate effort, the proficiency in both Python and C, and a solid understanding of the C-API, reference counting, garbage collection, and what not. On the other hand, many modules that could certainly benefit from native performance lack such an accelerator, e.g. difflib, textwrap, fractions, statistics, argparse, email, urllib.parse and many, many more. The asyncio module is becoming more and more important these days, but its native accelerator only covers a very small part of its large functionality, and it also does not expose a native API that performance hungry async tools could hook into. And even though the native accelerator of the ElementTree module is an almost complete replacement, the somewhat complex serialisation code is still implemented completely in Python, which shows in comparison to the native serialisation in lxml.

Compiling these modules with Cython would speed them up, probably quite visibly. For this use case, it is possible to keep the code entirely in Python, and just add enough type declarations to make it fast when compiled. The typing syntax that PEP-484 and PEP-526 added to Python 3.6 makes this really easy and straight forward. A manually written accelerator module could thus be avoided, and therefore a lot of duplicated functionality and maintenance overhead.

Feature development would also be substantially simplified, especially for new contributors. Since Cython compiles Python code, it would allow people to contribute a Python implementation of a new feature that compiles down to C. And we all know that developing new functionality is much easier in Python than in C. The remaining task is then only to optimise it and not to rewrite it in a different language.

My feeling is that replacing some parts of the CPython C development with Cython has the potential to bring a visible boost for the contributions to the CPython project.

Update 2018-09-12: Jeroen Demeyer reminded me that I should also mention the ease of wrapping external native libraries. While this is not something that is a big priority for the standard library anymore, it is certainly true that modules like sqlite (which wraps sqlite3), ssl (OpenSSL), expat or even the io module (which wraps system I/O capabilities) would have been easier to write and maintain in Cython than in C. Especially I/O related code is often intensive in error handling, which is nicer to do with raise and f-strings than error code passing in C.