Cython is 20!

Today, Cython celebrates its 20th anniversary!

On April 4th, 2002, Greg Ewing published the first release of Pyrex 0.1.

Already at the time, it was invented and designed as a compiler that extended the Python language with C data types to build extension modules for CPython. A design that survived the last 20 years, and that made Pyrex, and then Cython, a major corner stone of the Python data ecosystem. And way beyond that.

Now, on April 4th, 2022, its heir Cython is still very much alive and serves easily hundreds of thousands of developers worldwide, day to day.

I'm very grateful for Greg's ingenious invention at the time. Let's look back at how we got to where we are today.

I came to use Python around the time when Pyrex was first written. Python had already been around for a dozen years, version 2.2 was the latest, and had brought us new style classes which represented a major redesign of the type system, as well as the iterator protocol and even generators, which you had to enable with from __future__ import generators, because of potential conflicts with the new keyword yield. There weren't many times in Python's history when that was necessary. Amongst the greatest Python releases of all times, that was one of them.

Python 2.3 was in the makings, about to bring us another bunch of cool new features like sum(), enumerate() or sets as a data type in the standard library.

CPython already had its reputation of providing a large standard library, and a whole set of third-party packages that added another heap of functionality – although Perl's CPAN would easily cover all of it in its mere shadow. There was no pip install anything at the time (not even easy_install), no virtualenvs, no wheels (and no eggs), some form of binary packages, but you'd mostly python setup.py build your own software, especially on Linux. Those were the days.

Many of the binary third-party packages at the time were hand-written using the bare C-API. There was SWIG for generating C wrappers for lots of languages, including Python. It worked, it was actually great to be able to generate multiple wrappers from a single source. But they all looked mostly the same, and making them work and feel well for each of the language environments was hard to impossible. And few people really needed wrappers for more than the one language they used. So, lots of people used the C-API, and CPython had a reputation of being easily extensible in C – assuming you knew C well enough. And more and more Python users didn't.

In came Greg's idea of writing extension modules in Python. Or in something that looked a lot like Python, with a few C-ish extensions.

I don't know how he came up with the name Pyrex, which is a brand name for thermal-resistant glass (originally invented here in Germany in 1893). But Pyrex clearly hit a nerve at the time and grew very quickly. Within weeks and months, there was support not only for functions but for extension types, and for a growing number of Python language features.

By the time I came to use Pyrex, it was already in a very attractive and feature rich state. From the start, its unique selling point was to allow Python developers to write C code, without having to write C code. And that had made it the basis for large projects like Sage, for which it provided a critical software development infrastructure, being the glue code between heaps of C/C++ math libraries and Python.

In 2004, Martijn Faassen took on a project (and he's good at taking on projects) of making XML in Python actually usable. There was support before, there was minidom, there was PyXML with an extended feature set. But many XML features of the time were missing, the tools were memory hungry and slow, and the DOM-API was far from anything Python users would fall in love with.

There was also a Python interface for libxml2, a C library that covered a large part of the important XML technologies at the time. With the caveat of mapping mostly its C-API to Python 1:1, and thus feeling excessively C-ish and exotic to Python users, making it easy to trigger hard crashes at the same time.

There was another alternative, though: ElementTree, designed by the recently deceased Fredrik Lundh (thanks for all the fish, Fredrik). It was not in the standard library at the time, it only got adopted there in Python 2.5 (together with SQLite), one and a half years later. It was an external package based on the pyexpat parser, and it provided a lovely pythonic API for XML processing. But with even less features than minidom.

So, Martijn decided to bring it all together: the bunch of XML features from libxml2, exposed in the pythonic, and already well established, interface of ElementTree. And being a Python developer, wanting to design the interface from the Python point of view, he chose Pyrex to implement that wrapper, and called it lxml.

I found the lxml project almost a year later, while looking for something that I could use as an extensible XML API. I implemented some features for lxml to turn it into that, and along the way, made enhancements also to Pyrex. Over the Pyrex mailing list, I got in touch with other developers who had their own more or less enhanced versions of Pyrex, including Robert Bradshaw, one of the developers in the Sage project. Eventually, in 2007, we decided to follow the example of the Apache web server in bringing together the scattered bunch of existing Pyrex patches into a new project. William Stein from the Sage project came up with a good name, and the infrastructure to maintain it – github.com wasn't launched yet, and we used the Mercurial DVCS. Thus, the Cython project was born.

It was the beginning of a second, long success story.

In the early years, William Stein was able to provide funding to the Cython project from Sage's resources, given how important Cython was for the development of the Sage Math package. Cython was an integral part of the Sage development sprints called Sage days. We participated in Google's Summer of Code events that brought us in touch with Dag Sverre Seljebotn and Mark Florisson, both of whom moved Cython's integration with NumPy and data processing forward in large steps. And the Sage project also sponsored a workshop in München (where I was living at the time) so that we were able to sit together in person, for the first time, discussing, designing and building many great features in Cython, as well as major advances in the coverage of (then) more recent Python language features, in which Vitja Makarov played an important role.

Over time, the list of contributors to Cython grew longer and longer, from large feature additions to small bug fixes and helpful documentation improvements. In 2008, me and Lisandro Dalcín implemented support for Python 3.0 before it was even released, just like Pyrex and Cython have followed CPython's development ever since they existed, allowing users to easily adapt their extension modules to various C-API changes across (C)Python releases. And in the other direction, some of the optimisations that CPython's own internal code generation tool argument clinic employs for fast Python function argument parsing, were adopted from Cython.

I remember discussions and cooperations with the CPython developers Yury Selivanov on async features and with Victor Stinner, Petr Viktorin or Nick Coghlan on Python C-API topics. Several PyPy developers, including Ronan Lamy and Matti Picus, have helped in word and code to improve the integration and stability between both tools. The exchange with people from large and impactful projects like NumPy, Pandas and the scikit-* family of tools has always helped moving Cython in a user centric direction, while giving me a warm feeling that it truly enables its users to get their work done. And the emergence of complementary tools like pybind11 or Numba has helped to diversify the choices throughout the ecosystem in which Cython resides, while only broadening the field without reducing the impact that the language and compiler has for its users.

Today, after 20 years of development, Cython is a modern programming language, embedded in the Python language rather than the other way round, but still extending it with C/C++ super powers.

We helped our users help their users through many exciting endeavours along the way, taking pictures of black holes, sending robots to Mars, scaling up Django websites to a thousand million users, building climate models, analysing, processing and machine learning of human text, real world images, and other data from countless areas, be it scientific, financial, economic, ecologic or probably any other type of data from small to large scale.

I'm proud and happy to see how far Cython has come from its early beginnings. And I'm excited to continue seeing where it will go from here.

So, from New Zealand, from Europe and the Americas, from Asia, Australia and Africa, to anywhere on Earth, and maybe Mars…

Happy anniversary, Cython!

Mut, zu sprechen

In Celle finden Demos gegen Schwurbelnde statt. Das ist nichts Außergewöhnliches. In den meisten deutschen Städten gibt es inzwischen Demos von Schwurbler:innen, und auch Demos gegen diese Demos, um den Lügen, Halbwahrheiten und gesellschaftlichen Absonderungen Fakten, Diskurs und gesellschaftliche Solidarität entgegenzusetzen.

Auf einer der ersten Demonstrationen gab es ein offenes Mikrofon. Erste Rednerin dabei war eine Frau, die gleich als "jemand von der Gegenseite" vorgestellt wurde. Es war merklich schwer für sie, diesen Gang anzutreten. Ohnehin schon schwer genug, sich überhaupt vor eine Menge Menschen zu stellen und etwas zu sagen. Aber noch einmal schwerer mit der Bürde, "die Gegenseite" zu vertreten. Ihre Worte waren wenig gehaltvoll, ihr Redebeitrag nicht sehr erhellend. Sie stockte vor Beklommenheit. Auch das Ziel ihres Beitrags wurde nicht richtig klar. Schnell wurde sie ausgebuht. Es gab Forderungen, das Mikrofon abzuschalten. Schließlich zog sie sich auf Bibelzitate zurück, was ebenfalls nicht zu mehr Akzeptanz beitrug. Leute gingen fort, um sich "das nicht anhören zu müssen".

Ich muss sagen, ich hatte vor allem Mitleid mit dieser Frau. Es schien mir mutig von ihr, sich an dieses Mikrofon zu trauen, von dem sie wissen musste, dass es ihr keinen Beifall einbringen konnte. Und sie hatte keine Chance. Von Anfang an. Ich weiß nicht, ob sie es mit einem besseren Start und weniger Ablehnung noch geschafft hätte, etwas Interessantes oder Relevantes von sich zu geben. Vielleicht nicht. Aber wir haben ihr auch keine Chance dazu gegeben. Von Anfang an.

Wir, die auf der Seite der Solidarität stehen wollen. Auf der Seite des gesellschaftlichen Diskurses. Wir, die gegen das "wir gegen die anderen" der 'anderen' aufstehen wollen.

Ich muss mir nicht alles anhören. Wenn mir jemand im Ernst von absurden Ideen und abwegigen Vorstellungen erzählen möchte, dann ist es mein Recht, darüber zu lachen und meine Ohren zu verschließen. Und es gibt Grenzen des Erträglichen und Hinnehmbaren. Wenn jemand Lügen erzählt, die andere Menschen gefährden, sie von sicherem Verhalten, vom Schutz Benachteiligter abhalten sollen, dann gibt es nur eine Antwort: Widerspruch. Wenn jemand mit dem Finger auf Menschen zeigt und sagt, "das ist unwertes Leben", dann gibt es nur eine Antwort: Widerspruch. Wenn jemand die gesellschaftliche Solidarität bekämpft, dem Miteinander ein "wir gegen die" entgegensetzt, Menschen ausgrenzt und gefährdet. Dann heißt es aufstehen. Nein sagen. Rassismus ist keine Meinung. Entmenschlichung ist keine Meinung.

Aber wir sollten uns klar sein, dass wir in diesem Kampf Menschen gegenüber stehen. Diese Menschen mögen fehlgeleitet sein, mögen merkwürdige, absurde Vorstellungen haben. Aber auch sie sind Menschen. Wir dürfen ihrer Ideologie widersprechen, ihrem Geschwurbel Fakten entgegensetzen, ihrem unsolidarischen Verhalten Vorbilder und ihrer Gefährdung gegenseitige Hilfe und Zuversicht. Wir dürfen über ihre wirren Theorien lachen. Aber wir sollten uns nicht auf ein "wir gegen die" zurückziehen. Sie nicht zu entmenschlichten Gegnern erklären. Sie nicht ausschließen, weil sie "die anderen" sind.

Eine Spaltung der Gesellschaft ist nicht das, was wir als Gesellschaft wollen sollten. Eine offene Gesellschaft muss Diskurs und Widerspruch aushalten. Muss Menschen akzeptieren, die anderer Meinung sind. Auch wenn diese Meinung wirre Ideen sind und in Parlamente getragen wird.

Wir sollten nicht diejenigen sein, die den gesellschaftlichen Diskurs opfern, nur weil einzelne ihn aufgeben oder sogar torpedieren. Die Welt ist nicht schwarz-weiß. Wenn wir anfangen, sie weiß zu malen, helfen wir nur denjenigen, die sie schwarz malen wollen. Wir müssen akzeptieren, dass eine Gesellschaft bunt ist. Manchmal sehr bunt. Und dass graue Menschen Menschen sind.

Kommunalwahlergebnisse Celle 2021

Die Ergebnisse der Kommunalwahl im Landkreis Celle wurden veröffentlicht, allerdings wohl nur als fertige Webauswertung. Für alle, die die kompletten Daten haben möchten, habe ich die Ergebnisse der Ortsratswahlen hier zusammengestellt, mit einem Jupyter Notebook zur Auswertung.

Die Daten stammen von der Webseite des Landkreises Celle. Da es sich hierbei um allgemein verfügbare, generische Daten zu den demokratischen Wahlergebnissen handelt, gehe ich davon aus, dass sie keinem Urheberrecht oder sonstigen rechtlichen Einschränkungen unterliegen.

Should you ship the Cython generated C code or not?

When you use Cython for your Python extensions (not if, when ;-)), there are different opinions on whether you should generate the C code locally and ship it in your sdist source packages on PyPI, or you should make Cython a build-time dependency for your package and let users run it on their side.

Both approaches have their pros and cons, but I personally recommend generating the C code on the maintainer side and then shipping it in sdists. Here is a bit of an explanation to help you with your own judgement.

The C code that Cython generates is deterministic and very intentionally adaptive to where you C-compile it. We work hard to do all environment specific adaptations (Python version, C compiler, …) in the C code and not in the code generator that creates it. It's the holy cow of "generate once, compile everywhere". And that's one of the main selling points of Cython, we write C so you don't have to. But obviously, once the C code is generated, it cannot take as-of-now unknown future environmental changes into account any more, such as changes to the CPython C-API, which we only cover in newer Cython releases.

Because the C code is deterministic, making Cython a build time dependency and then pinning an exact Cython version with it is entirely useless, because you can just generate the exact same C code on your side once and ship it. One dependency less, lots of user side complexity avoided. So, the only case we're talking about here is allowing different (usually newer) Cython versions to build your code.

If you ship the C file, then you know what you get and you don't depend on whatever Cython version users have installed on their side. You avoid the maintenance burden of having to respond to bug reports for seemingly unrelated C code lines or bugs in certain Cython versions (which users will rarely mention in their bug reports).

If, instead, you use a recent Cython version at package build time, then you avoid bit rot in the generated C code, but you risk build failures on user side due to users having a buggy Cython version installed (which may not have existed when you shipped the package, so you couldn't exclude it from the dependency range). Or your code may fail to compile with a freshly released Cython due to incompatible language changes in the new version. However, if those (somewhat exceptional) cases don't happen, then you may end up with a setting in which your code adapts also to newer environments, by using a recent Cython version automatically and generated C code that already knows about the new environment. That is definitely an advantage.

Basically, for maintained packages, I consider shipping the generated C code the right way. Less hassle, easier debugging, better user experience. For unmaintained packages, regenerating the C code at build time can extend the lifetime of the package to newer environments for as long as it does not run into failures due to incompatible Cython compiler changes (so you trade one compatibility level for another one).

The question is whether the point at which a package becomes unmaintained can ever be clear enough to make the switch. Regardless of which way you choose, as with all code out there, at some point in the future someone will have to do something, either to your package/code or to your build setup, in order to prevent fatal bit rot. But using Cython in the first place should at least ease the pain of getting over that point when it occurs.

My responses to a Cython dev interview

I recently received a request for an online interview by Jonathan Ruiz, a CS student in Berlin. He's implementing graph algorithms as part of his final Bachelor thesis, and was evaluating and using Cython to get performance improvements. During his work, he thought it'd be nice to get some comments from a Cython core dev and sent me a couple of questions. Here's what I answered.

  1. First of all, thank you Stefan for your time in this difficult situation.

    Thanks for asking me.

  2. How did your interest in programming and then compilers begin?

    I have a pretty straight forward background and education in computer science and software development. But I'm not a compiler expert. In fact, I'm not even working on a compiler in the true sense. I'm working on Cython, which is a source code translator and code generator. The actual native code generation is then left to a C compiler. However, we avoid that distinction ourselves in the project because in the end, people use Cython to compile Python down to native code. So the distinction is more of an implementation detail.

    I came to Cython through a bit of a diversion. I needed a Python XML library for the proof-of-concept implementation of my doctor's thesis somewhere around 2005. Not so long before that, Martijn Faassen had started writing an ElementTree-like wrapper for the XML library libxml2, called lxml, which had several features that I needed and was easy enough for me to hack on to get the features implemented that I was missing.

    lxml was written in a code generator called Pyrex, and I ended up implementing a couple of features in that code generator that helped me in my work on lxml. Not all of these changes were accepted upstream, at least not in a timely fashion, and at some point I found that others had that problem, too, and had ended up with their own long-term forks. Together with Robert Bradshaw and William Stein from the University of Washington in Seattle, USA, we decided to fork Pyrex for good, and start a new official project, which we named Cython. That was in 2007, and I've worked on the Cython project ever since.

  3. What advice would you give to students who want to break into this field?

    Read code. Seriously. There is a lot that you can learn at a university about algorithms, about smart ideas that people came up with, about ways to tell and decide what's smart and what isn't, about the way things work (and should work) in general. A CS degree is an excellent way to set a basis for your future software design endeavours.

    But there's nothing that comes close to reading other people's code when you're trying to understand how things work in real life and why the tools at hand don't do what you want them to do. And then fixing them to do it.

  4. Which branches of mathematics do you think are important to become a good programmer or what particularly benefited you in Cython optimisation, for example?

    I would love to say that my math education at university helped me here and there, but in retrospect, I need to admit that I could have come to the same point where I stand now with just my math lessons at school (although those were pretty decent, I guess). I would claim that statistics are surprisingly important in real life and software development, and are not always taught deeply enough at school (nor in CS studies), IMHO. Even just the understanding that the result of a benchmark run is just a single value in a cloud of scattered results really helps putting those numbers in context in your head.

    There are definitely fields in software development in which math is more helpful than in the fields I've touched mostly. Graphics comes to mind, for example. But I think what's much more important than a math education is the ability to read and learn, and to be curious of the work of others. Because these days, 95+% of our software development work is based on what others have already done before us (and for us). Use existing tools, learn how they work and what their limits are, and then extend those limits when you need to.

  5. If I'm not mistaken, since April 2019 you have also been a core developer in CPython: what responsibilities does this position entail?

    The main (and most obvious) difference is the ability to click the green merge button on github. :) Seriously, you can do a lot of great work in a project without ever clicking that button. You can create tickets, investigate bugs, write documentation, recommend cool projects to others, help people use them, participate in design discussions, write feature pull requests. You can move a project truly forward without being a "core developer". But once you have the merge right, you are taking over the responsibility for the code that you merge by clicking that button, wherever that code came from. If that code unexpectedly breaks someone's computer at the end of the world, you are the one who has to fix it, somehow. Even just by reverting the merge, but you have to do something. That changes your perspective on that piece of code a lot. Even if the author is you.

    Being a core developer in a project is really more of an obligation than an honour. But it can also give you a better standing in a project, because others can see that you are taking responsibility for it. So it comes with a bit of a social status, too.

  6. Cython has been and is a key tool in scientific projects, such as the Event Horizon Telescope. Which scientific libraries are you missing in Cython right now? Are there any special ones that you are working on?

    I'm not working on scientific libraries myself, although I know a lot of people from other projects in the field. I'm not missing anything here, personally. :)

    OTOH, I like hearing about things that others do with Cython. And I like to help others to make Cython do great things for them.

    The really cool thing about OpenSource software development is that I'm creating "eternal" values every day. Whatever I write today may end up helping some person on the other side of the planet, or next door, to invent something cool, to answer the last questions about life, the universe and everything, to save the world or someone else's life. That's their projects, their ideas and their work, but it's the software that I am writing together with lots of other people that helps them get their work done. And that is a great feeling.

  7. Are there any important features that you would like to implement in Cython in the future?

    The issue tracker has more than 740 open tickets right now. :) But that answer misses the point. I think the most important goal is to keep helping users getting unstuck when they run into something that they can't really (or easily) solve themselves. Or to help them fix their own problems in a way that more people can benefit from. Cython is a tool for others to use for their own needs. It should continue to achieve that.

  8. How do you see Cython ten years from now?

    I never liked that question when interviewing dev candidates, and I'm not going to answer it now. ;-) Ten years is about 10-20% of a human's total lifetime. And it's half of an eternity in Tech. It's a very long time. I like how Niels Bohr (supposedly) put it: "predictions are hard to make, especially about the future".

  9. Thank you very much again for your time, Stefan. Take good care of yourself.

    Have fun and stay safe.

Faster XML stream processing in Python

It's been a while since I last wrote something about processing XML, specifically about finding something in XML. Recently, I read a blog post by Eli Bendersky about faster XML processing in Go, and he was comparing it to iterparse() in Python's ElementTree and lxml. Basically, all he said about lxml is that it performs more or less like ElementTree, so he concentrated on the latter (and on C and Go). That's not wrong to say, but it also doesn't help much. lxml has much more fine-grained tools for processing XML, so here's a reply.

I didn't have the exact same XML input file that Eli used, but I used the same (deterministic, IIUC) tool for generating one, running xmlgen -f2 -o bench.xml. That resulted in a 223MiB XML file of the same structure that Eli used, thus probably almost the same as his.

Let's start with the original implementation:

import sys
import xml.etree.ElementTree as ET

count = 0
for event, elem in ET.iterparse(sys.argv[1], events=("end",)):
    if event == "end":
        if elem.tag == 'location' and elem.text and 'Africa' in elem.text:
            count += 1
        elem.clear()

print('count =', count)

The code parses the XML file, searches for location tags, and counts those that contain the word Africa.

Running this under time with ElementTree in CPython 3.6.8 (Ubuntu 18.04) shows:

count = 92
4.79user 0.08system 0:04.88elapsed 99%CPU (0avgtext+0avgdata 14828maxresident)k

We can switch to lxml (4.3.4) by changing the import to import lxml.etree as ET:

count = 92
4.58user 0.08system 0:04.67elapsed 99%CPU (0avgtext+0avgdata 23060maxresident)k

You can see that it uses somewhat more memory overall (~23MiB), but runs just a little faster, not even 5%. Both are roughly comparable.

For comparison, the base line memory usage of doing nothing but importing ElementTree versus lxml is:

$ time python3.6 -c 'import xml.etree.ElementTree'
0.08user 0.01system 0:00.09elapsed 96%CPU (0avgtext+0avgdata 9892maxresident)k
0inputs+0outputs (0major+1202minor)pagefaults 0swaps

$ time python3.6 -c 'import lxml.etree'
0.07user 0.01system 0:00.09elapsed 96%CPU (0avgtext+0avgdata 15264maxresident)k
0inputs+0outputs (0major+1742minor)pagefaults 0swaps

Back to our task at hand. As you may know, global variables in Python are more costly than local variables, and as you certainly know, global module code is badly testable. So, let's start with something obvious that we would always do in Python: write a function.

import sys
import lxml.etree as ET

def count_locations(file_path, match):
    count = 0
    for event, elem in ET.iterparse(file_path, events=("end",)):
        if event == "end":
            if elem.tag == 'location' and elem.text and 'Africa' in elem.text:
                count += 1
            elem.clear()

count = count_locations(sys.args[1], 'Africa')
print('count =', count)
count = 92
4.39user 0.06system 0:04.46elapsed 99%CPU (0avgtext+0avgdata 23264maxresident)k

Another thing we can see is that we're explicitly asking for only end events, and then check if the event we got is an end event. That's redundant. Removing this line yields:

count = 92
4.24user 0.06system 0:04.31elapsed 99%CPU (0avgtext+0avgdata 23264maxresident)k

Ok, another tiny improvement. We won a couple of percent, although not really worth mentioning. Now let's see what lxml's API can do for us.

First, let's look at the structure of the XML file. Nicely, the xmlgen tool has a mode for generating an indented version of the same file, which makes it easier to investigate. Here's the start of the indented version of the file (note that we are always parsing the smaller version of the file, which contains newlines but no indentation):

<?xml version="1.0" standalone="yes"?>
<site>
  <regions>
    <africa>
      <item id="item0">
        <location>United States</location>
        <quantity>1</quantity>
        <name>duteous nine eighteen </name>
        <payment>Creditcard</payment>
        <description>
          <parlist>
            <listitem>
              <text>

The root tag is site, which then contains regions (apparently one per continent), then a series of item elements, which contain the location. In a real data file, it would probably be enough to only look at the africa region when looking for Africa as a location, but a) this is (pseudo-)randomly generated data, b) even "real" data isn't always clean, and c) a location "Africa" actually seems weird when the region is already africa

Anyway. Let's assume we have to look through all regions to get a correct count. But given the structure of the item tag, we can simply select the location elements and do the following in lxml:

def count_locations(file_path, match):
    count = 0
    for event, elem in ET.iterparse(file_path, events=("end",), tag='location'):
        if elem.text and match in elem.text:
            count += 1
        elem.clear()
    return count
count = 92
3.06user 0.62system 0:03.68elapsed 99%CPU (0avgtext+0avgdata 1529292maxresident)k

That's a lot faster. But what happened to the memory? 1.5 GB? We used to be able to process the whole file with only 23 MiB peak!

The reason is that the loop now only runs for location elements, and everything else is only handled internally by the parser – and the parser builds an in-memory XML tree for us. The elem.clear() call, that we previously used for deleting used parts of that tree, is now only executed for the location, a pure text tag, and thus cleans up almost nothing. We need to take care to clean up more again, so let's intercept on the item and look for the location from there.

def count_locations(file_path, match):
    count = 0
    for _, elem in ET.iterparse(file_path, events=("end",), tag='item'):
        text = elem.findtext('location')
        if text and match in text:
            count += 1
        elem.clear()
    return count
count = 92
3.11user 0.37system 0:03.50elapsed 99%CPU (0avgtext+0avgdata 994280maxresident)k

Ok, almost as fast, but still – 1 GB of memory? Why doesn't the cleanup work? Let's look at the file structure some more.

$ egrep -n '^(  )?<' bench_pp.xml
1:<?xml version="1.0" standalone="yes"?>
2:<site>
3:  <regions>
2753228:  </regions>
2753229:  <categories>
2822179:  </categories>
2822180:  <catgraph>
2824181:  </catgraph>
2824182:  <people>
3614042:  </people>
3614043:  <open_auctions>
5520437:  </open_auctions>
5520438:  <closed_auctions>
6401794:  </closed_auctions>
6401795:</site>

Ah, so there is actually much more data in there that is completely irrelevant for our task! All we really need to look at is the first ~2.7 million lines that contain the regions data. The entire second half of the file is useless, and simply generates heaps of data that our cleanup code does not handle. Let's make use of that learning in our code. We can intercept on both the item and the regions tags, and stop as soon as the regions data section ends.

def count_locations(file_path, match):
    count = 0
    for _, elem in ET.iterparse(file_path, events=("end",), tag=('item', 'regions')):
        if elem.tag == 'regions':
            break
        text = elem.findtext('location')
        if text and match in text:
            count += 1
        elem.clear()
    return count
count = 92
1.22user 0.04system 0:01.27elapsed 99%CPU (0avgtext+0avgdata 22048maxresident)k

That's great! We're actually using less memory than in the beginning now, and managed to cut down the runtime from 4.6 seconds to 1.2 seconds. That's almost a factor of 4!

Let's try one more thing. We are already intercepting on two tag names, and then searching for a third one. Why not intercept on all three directly?

def count_locations(file_path, match):
    count = 0
    for _, elem in ET.iterparse(file_path, events=("end",),
                                tag=('item', 'location', 'regions')):
        if elem.tag == 'location':
            text = elem.text
            if text and match in text:
                count += 1
        elif elem.tag == 'regions':
            break
        else:
            elem.clear()
    return count
count = 92
1.10user 0.03system 0:01.13elapsed 99%CPU (0avgtext+0avgdata 21912maxresident)k

Nice. Another bit faster, and another bit less memory used.

Anything else we can do? Yes. We can tune the parser a little more. Since we're only interested in the non-empty text content inside of tags, we can ignore all newlines that appear in our input file between the tags. lxml's parser has an option for removing such blank text, which avoids creating an in-memory representation for it.

def count_locations(file_path, match):
    count = 0
    for _, elem in ET.iterparse(file_path, events=("end",),
                                tag=('item', 'location', 'regions'),
                                remove_blank_text=True):
        if elem.tag == 'location':
            text = elem.text
            if text and match in text:
                count += 1
        elif elem.tag == 'regions':
            break
        else:
            elem.clear()
    return count
count = 92
0.97user 0.02system 0:01.00elapsed 99%CPU (0avgtext+0avgdata 21928maxresident)k

While the overall memory usage didn't change, the avoided processing time for creating the useless text nodes and cleaning them up from memory is quite visible.

Overall, algorithmically improving our code and making better use of lxml's features gave us a speedup from initially 4.6 seconds down to one second. And we paid for that improvement with 4 additional lines of code inside our function. That's only half of the code which Eli's SAX based Go implementation needs (which, mind you, does not build an in-memory tree for you at all). And the Go code is only slightly faster than the initial Python implementations that we started from. Way to go! ;-)

Speaking of SAX, lxml also has a SAX interface. So let's compare how that performs.

import sys
import lxml.etree as ET

class Done(Exception):
    pass

class SaxCounter:
    in_location = False
    def __init__(self, match):
        self.count = 0
        self.match = match
        self.text = []
        self.data = self.text.append

    def start(self, tag, attribs):
        self.is_location = tag == 'location'
        del self.text[:]

    def end(self, tag):
        if tag == 'location':
            if self.text and self.match in ''.join(self.text):
                self.count += 1
        elif tag == 'regions':
            raise Done()

    def close(self):
        pass

def count_locations(file_path, match):
    target = SaxCounter(match)
    parser = ET.XMLParser(target=target)
    try:
        ET.parse(file_path, parser=parser)
    except Done:
        pass
    return target.count

count = count_locations(sys.argv[1], 'Africa')
print('count =', count)
count = 92
1.23user 0.02system 0:01.25elapsed 99%CPU (0avgtext+0avgdata 16060maxresident)k

And the exact same code works in ElementTree if you change the import again:

count = 92
1.83user 0.02system 0:01.85elapsed 99%CPU (0avgtext+0avgdata 10280maxresident)k

Also, removing the regions check from the end() SAX method above, thus reading the entire file, yields this for lxml:

count = 92
3.22user 0.04system 0:03.27elapsed 99%CPU (0avgtext+0avgdata 15932maxresident)k

and this for ElementTree:

count = 92
4.72user 0.07system 0:04.79elapsed 99%CPU (0avgtext+0avgdata 10300maxresident)k

Seeing the numbers in comparison to iterparse(), it does not seem worth the complexity, unless the memory usage is really, really pressing.

A final note: here's the improved ElementTree iterparse() implementation that also avoids parsing useless data.

import sys
import xml.etree.ElementTree as ET

def count_locations(file_path, match):
    count = 0
    for event, elem in ET.iterparse(file_path, events=("end",)):
        if elem.tag == 'location':
            if elem.text and match in elem.text:
                count += 1
        elif elem.tag == 'regions':
            break
        elem.clear()
    return count

count = count_locations(sys.argv[1], 'Africa')
print('count =', count)
count = 92
1.71user 0.02system 0:01.74elapsed 99%CPU (0avgtext+0avgdata 11876maxresident)k

And while not as fast as the lxml version, it still runs considerably faster than the original implementation. And uses less memory.

Learnings to take away:

  • Say what you want.

  • Stop when you have it.

Speeding up basic object operations in Cython

Raymond Hettinger published a nice little micro-benchmark script for comparing basic operations like attribute or item access in CPython and comparing the performance across Python versions. Unsurprisingly, Cython performs quite well in comparison to the latest CPython 3.8-pre development version, executing most operations 30-50% faster. But the script allowed me to tune some more performance out of certain less well performing operations. The timings are shown below, first those for CPython 3.8-pre as a baseline, then (for comparison) the Cython timings with all optimisations disabled that can be controlled by C macros (gcc -DCYTHON_...=0), the normal (optimised) Cython timings, and the now improved version at the end.

CPython 3.8 (pre)

Cython 3.0 (no opt)

Cython 3.0 (pre)

Cython 3.0 (tuned)

Variable and attribute read access:

  read_local
            5.5 ns
            0.2 ns
            0.2 ns
            0.2 ns
  read_nonlocal
            6.0 ns
            0.2 ns
            0.2 ns
            0.2 ns
  read_global
           17.9 ns
           13.3 ns
            2.2 ns
            2.2 ns
  read_builtin
           21.0 ns
            0.2 ns
            0.2 ns
            0.1 ns
  read_classvar_from_class
           23.7 ns
           16.1 ns
           14.1 ns
           14.1 ns
  read_classvar_from_instance
           20.9 ns
           11.9 ns
           11.2 ns
           11.0 ns
  read_instancevar
           31.7 ns
           22.3 ns
           20.8 ns
           22.0 ns
  read_instancevar_slots
           25.8 ns
           16.5 ns
           15.3 ns
           17.0 ns
  read_namedtuple
           23.6 ns
           16.2 ns
           13.9 ns
           13.5 ns
  read_boundmethod
           32.5 ns
           23.4 ns
           22.2 ns
           21.6 ns

Variable and attribute write access:

  write_local
            6.4 ns
            0.2 ns
            0.1 ns
            0.1 ns
  write_nonlocal
            6.8 ns
            0.2 ns
            0.1 ns
            0.1 ns
  write_global
           22.2 ns
           13.2 ns
           13.7 ns
           13.0 ns
  write_classvar
          114.2 ns
          103.2 ns
          113.9 ns
           94.7 ns
  write_instancevar
           49.1 ns
           34.9 ns
           28.6 ns
           29.8 ns
  write_instancevar_slots
           33.4 ns
           22.6 ns
           16.7 ns
           17.8 ns

Data structure read access:

  read_list
           23.1 ns
            5.5 ns
            4.0 ns
            4.1 ns
  read_deque
           24.0 ns
            5.7 ns
            4.3 ns
            4.4 ns
  read_dict
           28.7 ns
           21.2 ns
           16.5 ns
           16.5 ns
  read_strdict
           23.3 ns
           10.7 ns
           10.5 ns
           12.0 ns

Data structure write access:

  write_list
           28.0 ns
            8.2 ns
            4.3 ns
            4.2 ns
  write_deque
           29.5 ns
            8.2 ns
            6.3 ns
            6.4 ns
  write_dict
           32.9 ns
           24.0 ns
           21.7 ns
           22.6 ns
  write_strdict
           29.2 ns
           16.4 ns
           15.8 ns
           16.0 ns

Stack (or queue) operations:

  list_append_pop
           63.6 ns
           67.9 ns
           20.6 ns
           20.5 ns
  deque_append_pop
           56.0 ns
           81.5 ns
          159.3 ns
           46.0 ns
  deque_append_popleft
           58.0 ns
           56.2 ns
           88.1 ns
           36.4 ns

Timing loop overhead:

  loop_overhead
            0.4 ns
            0.2 ns
            0.1 ns
            0.2 ns

Some things that are worth noting:

  • There is always a bit of variance across the runs, so don't get excited about a couple of percent difference.

  • The read/write access to local variables is not reasonably measurable in Cython since it uses local/global C variables, and the C compiler discards any useless access to them. But don't worry, they are really fast.

  • Builtins (and module global variables in Py3.6+) are cached, which explains the "close to nothing" timings for them above.

  • Even with several optimisations disabled, Cython code is still visibly faster than CPython.

  • The write_classvar benchmark revealed a performance problem in CPython that is being worked on.

  • The deque related benchmarks revealed performance problems in Cython that are now fixed, as you can see in the last column.

Die glücklichen 2000

Hin und wieder begegne ich Menschen, die bei Dingen, die "jede/r weiß", etwas überrascht schauen. Dazu gibt es einen schönen xkcd Comic. Ich habe mal nachgeschaut, in Deutschland lag die Geburtenrate 2017 bei 785.000 Kindern, mit deutlich steigender Tendenz seit 2011. Das bedeutet, dass in diesem Land an jedem Tag im Durchschnitt so um die 2.000 Menschen zum ersten Mal von einer Sache hören, die schon "alle wissen" (zumindest alle Erwachsenen). Jeden Tag, 2.000 Menschen. Lasst es uns für sie zu einer schönen Erfahrung machen.

What's new in Cython 0.29?

I'm happy to announce the release of Cython 0.29. In case you didn't hear about Cython before, it's the most widely used statically optimising Python compiler out there. It translates Python (2/3) code to C, and makes it as easy as Python itself to tune the code all the way down into fast native code. This time, we added several new features that help with speeding up and parallelising regular Python code to escape from the limitations of the GIL.

So, what exactly makes this another great Cython release?

The contributors

First of all, our contributors. A substantial part of the changes in this release was written by users and non-core developers and contributed via pull requests. A big "Thank You!" to all of our contributors and bug reporters! You really made this a great release.

Above all, Gabriel de Marmiesse has invested a remarkable amount of time into restructuring and rewriting the documentation. It now has a lot less historic smell, and much better, tested (!) code examples. And he obviously found more than one problematic piece of code in the docs that we were able to fix along the way.

Cython 3.0

And this will be the last 0.x release of Cython. The Cython compiler has been in production critical use for years, all over the world, and there is really no good reason for it to have an 0.x version scheme. In fact, the 0.x release series can easily be counted as 1.x, which is one of the reasons why we now decided to skip the 1.x series all together. And, while we're at it, why not the 2.x prefix as well. Shift the decimals of 0.29 a bit to the left, and then the next release will be 3.0. The main reason for that is that we want 3.0 to do two things: a) switch the default language compatibility level from Python 2.x to 3.x and b) break with some backwards compatibility issues that get more in the way than they help. We have started collecting a list of things to rethink and change in our bug tracker.

Turning the language level switch is a tiny code change for us, but a larger change for our users and the millions of source lines in their code bases. In order to avoid any resemblance with the years of effort that went into the Py2/3 switch, we took measures that allow users to choose how much effort they want to invest, from "almost none at all" to "as much as they want".

Cython has a long tradition of helping users adapt their code for both Python 2 and Python 3, ever since we ported it to Python 3.0. We used to joke back in 2008 that Cython was the easiest way to migrate an existing Py2 code base to Python 3, and it was never really meant as a joke. Many annoying details are handled internally in the compiler, such as the range versus xrange renaming, or dict iteration. Cython has supported dict and set comprehensions before they were backported to Py2.7, and has long provided three string types (or four, if you want) instead of two. It distinguishes between bytes, str and unicode (and it knows basestring), where str is the type that changes between Py2's bytes str and Py3's Unicode str. This distinction helps users to be explicit, even at the C level, what kind of character or byte sequence they want, and how it should behave across the Py2/3 boundary.

For Cython 3.0, we plan to switch only the default language level, which users can always change via a command line option or the compiler directive language_level. To be clear, Cython will continue to support the existing language semantics. They will just no longer be the default, and users have to select them explicitly by setting language_level=2. That's the "almost none at all" case. In order to prepare the switch to Python 3 language semantics by default, Cython now issues a warning when no language level is explicitly requested, and thus pushes users into being explicit about what semantics their code requires. We obviously hope that many of our users will take the opportunity and migrate their code to the nicer Python 3 semantics, which Cython has long supported as language_level=3.

But we added something even better, so let's see what the current release has to offer.

A new language-level

Cython 0.29 supports a new setting for the language_level directive, language_level=3str, which will become the new default language level in Cython 3.0. We already added it now, so that users can opt in and benefit from it right away, and already prepare their code for the coming change. It's an "in between" kind of setting, which enables all the nice Python 3 goodies that are not syntax compatible with Python 2.x, but without requiring all unprefixed string literals to become Unicode strings when the compiled code runs in Python 2.x. This was one of the biggest problems in the general Py3 migration. And in the context of Cython's integration with C code, it got in the way of our users even a bit more than it would in Python code. Our goals are to make it easy for new users who come from Python 3 to compile their code with Cython and to allow existing (Cython/Python 2) code bases to make use of the benefits before they can make a 100% switch.

Module initialisation like Python does

One great change under the hood is that we managed to enable the PEP-489 support (again). It was already mostly available in Cython 0.27, but lead to problems that made us back-pedal at the time. Now we believe that we found a way to bring the saner module initialisation of Python 3.5 to our users, without risking the previous breakage. Most importantly, features like subinterpreter support or module reloading are detected and disabled, so that Cython compiled extension modules cannot be mistreated in such environments. Actual support for these little used features will probably come at some point, but will certainly require an opt-in of the users, since it is expected to reduce the overall performance of Python operations quite visibly. The more important features like a correct __file__ path being available at import time, and in fact, extension modules looking and behaving exactly like Python modules during the import, are much more helpful to most users.

Compiling plain Python code with OpenMP and memory views

Another PEP is worth mentioning next, actually two PEPs: 484 and 526, vulgo type annotations. Cython has supported type declarations in Python code for years, has switched to PEP-484/526 compatible typing with release 0.27 (more than one year ago), and has now gained several new features that make static typing in Python code much more widely usable. Users can now declare their statically typed Python functions as not requiring the GIL, and thus call them from a parallel OpenMP loops and parallel Python threads, all without leaving Python code compatibility. Even exceptions can now be raised directly from thread-parallel code, without first having to acquire the GIL explicitly.

And memory views are available in Python typing notation:

import cython
from cython.parallel import prange

@cython.cfunc
@cython.nogil
def compute_one_row(row: cython.double[:]) -> cython.int:
    ...

def process_2d_array(data: cython.double[:,:]):
    i: cython.Py_ssize_t

    for i in prange(data.shape[0], num_threads=16, nogil=True):
        compute_one_row(data[i])

This code will work with NumPy arrays when run in Python, and with any data provider that supports the Python buffer interface when compiled with Cython. As a compiled extension module, it will execute at full C speed, in parallel, with 16 OpenMP threads, as requested by the prange() loop. As a normal Python module, it will support all the great Python tools for code analysis, test coverage reporting, debugging, and what not. Although Cython also has direct support for a couple of those by now. Profiling (with cProfile) and coverage analysis (with coverage.py) have been around for several releases, for example. But debugging a Python module in the interpreter is obviously still much easier than debugging a native extension module, with all the edit-compile-run cycle overhead.

Cython's support for compiling pure Python code combines the best of both worlds: native C speed, and easy Python code development, with full support for all the great Python 3.7 language features, even if you still need your (compiled) code to run in Python 2.7.

More speed

Several improvements make use of the dict versioning that was introduced in CPython 3.6. It allows module global names to be looked up much faster, close to the speed of static C globals. Also, the attribute lookup for calls to cpdef methods (C methods with Python wrappers) can benefit a lot, it can become up to 4x faster.

Constant tuples and slices are now deduplicated and only created once at module init time. Especially with common slices like [1:] or [::-1], this can reduce the amount of one-time initialiation code in the generated extension modules.

The changelog lists several other optimisations and improvements.

Many important bug fixes

We've had a hard time following a change in CPython 3.7 that "broke the world", as Mark Shannon put it. It was meant as a mostly internal change on their side that improved the handling of exceptions inside of generators, but it turned out to break all extension modules out there that were built with Cython, and then some. A minimal fix was already released in Cython 0.28.4, but 0.29 brings complete support for the new generator exception stack in CPython 3.7, which allows exceptions raised or handled by Cython implemented generators to interact correctly with CPython's own generators. Upgrading is therefore warmly recommended for better CPython 3.7 support. As usual with Cython, translating your existing code with the new release will make it benefit from the new features, improvements and fixes.

Stackless Python has not been a big focus for Cython development so far, but the developers noticed a problem with Cython modules earlier this year. Normally, they try to keep Stackless binary compatible with CPython, but there are corner cases where this is not possible (specifically frames), and one of these broke the compatibility with Cython compiled modules. Cython 0.29 now contains a fix that makes it play nicely with Stackless 3.x.

A funny bug that is worth noting is a mysteriously disappearing string multiplier in earlier Cython versions. A constant expression like "x" * 5 results in the string "xxxxx", but "x" * 5 + "y" becomes "xy". Apparently not a common code construct, since no user ever complained about it.

Long-time users of Cython and NumPy will be happy to hear that Cython's memory views are now API-1.7 clean, which means that they can get rid of the annoying "Using deprecated NumPy API" warnings in the C compiler output. Simply append the C macro definition ("NPY_NO_DEPRECATED_API", "NPY_1_7_API_VERSION") to the macro setup of your distutils extensions in setup.py to make them disappear. Note that this does not apply to the old low-level ndarray[...] syntax, which exposes several deprecated internals of the NumPy C-API that are not easy to replace. Memory views are a fast high-level abstraction that does not rely specifically on NumPy and therefore does not suffer from these API constraints.

Less compilation :)

And finally, as if to make a point that static compilation is a great tool but not always a good idea, we decided to reduce the number of modules that Cython compiles of itself from 13 down to 8, thus keeping 5 more modules normally interpreted by Python. This makes the compiler runs about 5-7% slower, but reduces the packaged size and the installed binary size by about half, thus reducing download times in CI builds and virtualenv creations. Python is a very efficient language when it comes to functionality per line of code, and its byte code is similarly high-level and efficient. Compiled native code is a lot larger and more verbose in comparison, and this can easily make a difference of megabytes of shared libraries versus kilobytes of Python modules.

We therefore repeat our recommendation to focus Cython's usage on the major pain points in your application, on the critical code sections that a profiler has pointed you at. The ability to compile those, and to tune them at the C level, is what makes Cython such a great and versatile tool.

Der Facebook-Effekt, oder: warum Wahlergebnisse uns heute wieder überraschen

Geschrieben im Juli 2016.

Die meisten Leute haben schon mal vom "Kleine-Welt-Phänomen" gehört. Es erklärt verschiedene Alltagseffekte, unter anderem die Überraschung, in einer noch so unmöglichen Situation jemanden zu treffen, mit dem oder der ich in irgendeiner Verbindung stehe, ohne uns schon jemals begegnet zu sein. Sei es die Existenz eines gemeinsamen Bekannten, eine ähnliche Herkunft, oder ein gemeinsames Ereignis in der Vergangenheit. Technisch gesprochen beschreibt es die Eigenschaft eines Netzwerks (oder Graphen), dass jeder Knoten mit allen anderen Knoten über einen extrem kurzen Pfad in Verbindung steht. Dies trifft oft auf menschliche Bekanntschaftsverhältnisse und soziale Netzwerke zu, bei denen die Entfernung zwischen zwei beliebig herausgegriffenen Menschen meist weniger als 7 Zwischenschritte über einander Bekannte beträgt.

Insbesondere in den heute sogenannten Sozialen Netzwerken im Internet wird diese Eigenschaft geradezu zelebriert. Da es so einfach ist, mich mit irgendwelchen Menschen (oder zumindest irgendwelchen Accounts) irgendwo auf der Welt zu vernetzen, bilden diese Netzwerke sehr ausgeprägte Kleine-Welt-Graphen. Hier ist jeder andere Teilnehmer wirklich nur ein paar Netzwerkklicks entfernt. Die totale globale Vernetzung. Endlich wächst die Menschheit zusammen. Oder?

Oft wird dabei vergessen, dass es noch einen zweiten Aspekt solcher Graphen gibt. Das eine ist die minimale Entfernung zu jedem Knoten durch den gesamten Graphen hindurch. Das andere ist jedoch die Menge der direkten Verbindungen jedes einzelnen Knotens. Gerade in den Sozialen Internetnetzwerken ist es so einfach, neue Verbindungen zu erstellen und so neue "Freunde" zu gewinnen, dass ich mich sofort mit allen vernetzen kann, die mich irgendwie interessieren oder denen ich in irgendeinem positiven Sinne begegne. Im Umkehrschluss bedeutet das aber, dass die zweite Ebene, die der "Freunde" meiner "Freunde", eigentlich gar nicht mehr relevant ist. Ganz zu schweigen von der dritten und allen weiteren. Denn mit allen "Freunden" meiner "Freunde", die mich interessieren, kann ich mich ja auch direkt verbinden. Was ich ja auch mache. Damit werden es meine direkten "Freunde".

Aber wen nehme ich in den Kreis meiner eigenen "Freunde" auf? Wem "folge" ich in diesen Sozialen Netzwerken? Natürlich Menschen (oder Accounts), die so denken wie ich, mit denen ich mich im Einklang befinde, die mir gefallen. Aber verbinde ich mich auch mit Menschen, die anders denken als ich? Die nicht meine politischen Ansichten teilen? Die mir widersprächen, wenn ich nur mit Ihnen redete? Oder deren Ausdrucksweise nicht meiner sozialen Schicht entspricht? Prolls? Sozen? Rechtsradikale? Kinderschläger? EU-Gegner? Warmduschpropagandisten?

Warum sollte ich mir das antun? Wenn die "Freunde" meiner "Freunde" so etwas tun, dann sollen sie das eben. Aber meine "Freunde" werden sie damit nicht. Vielleicht lese ich mal einen Kommentar von diesen Leuten und rege mich darüber auf, aber das reicht ja dann auch wieder. Jeden Tag brauche ich das jedenfalls nicht.

Es zeigt sich also relativ schnell, dass diese Sozialen Internetnetzwerke einen Effekt verstärken, der Gleiches im Gleichen sucht und Anderes ausgrenzt. Eine moderne Form der Ghettoisierung. Die Welt mag noch so klein sein, der kürzeste Pfad zu allen Menschen auf der Welt noch so kurz - mir reicht die eine, direkte Verbindung zu denen, die meiner Meinung sind.

In Wirklichkeit wächst die Menschheit nicht zusammen. Nur die Trennlinien verlagern sich. Weg von Wohnort und Aussehen, hin zu Verhalten, Bildungsniveau und sozialen Unterschieden. Und die Trennung verstärkt sich. Warum sollte ich mit "Freunden" meiner "Freunde" überhaupt reden, die ich nie zu meinen eigenen "Freunden" machen würde?

Es gibt viele Menschen, gerade junge Menschen, die Ihren Medienkonsum größtenteils oder sogar komplett von den klassischen Medien Zeitung und Fernsehen weg in Soziale Internetnetzwerke verlagert haben. "Meine Freunde werden mich schon auf dem Laufenden halten", ist oft der Tenor dahinter. Wenn etwas passiere, dann kriege man das ja auch so mit. Sicherlich. Aber man läuft eben auch Gefahr, die "Nicht-Freunde" und ihre Meinungen auszublenden. Die Anderen. Die, mit denen ich mich nie direkt vernetzen würde. Weil sie eben nicht meiner Meinung sind. Weil sie eine Meinung vertreten, die ich ablehne. Die ich nicht hören möchte. Die ich ausblende. Die nicht der Meinung meiner "Freunde" entspricht. So funktionieren nicht nur die sozialen Netzwerke um einen Anders Brevik herum oder die von Pegida. Selbstselektierende Zugehörigkeit und Ghettobildung ist eine grundlegende Eigenschaft Sozialer Netzwerke im Internet, egal welcher Art die jeweiligen Auswahlkriterien sind.

Eine entscheidende Tatsache, die wir bei der Abstimmung über den Ausstieg Britanniens aus der EU am 23. Juni 2016 gesehen haben, war die geringe Beteiligung junger Wähler. Nur ein Drittel der unter 25-jährigen sind überhaupt zur Abstimmung gegangen. Und nur die Hälfte der 25- bis 35-jährigen. Obwohl gerade diese Altersgruppen über Austauschprogramme, Reisefreiheit und den offenen Arbeitsmarkt am stärksten von der EU-Mitgliedschaft profitieren und im Vergleich zu den stark engagierten über-65-jährigen noch sehr, sehr lange davon profitiert hätten. Noch den größten Teil ihres gesamten Lebens.

Es gibt eine gute Erklärung dafür: falsche Sicherheit. Viele Menschen der jungen, gut ausgebildeten Altersgruppen dürften im Internet gut untereinander vernetzt sein, aber wenig direkte Kontakte zu deutlich älteren oder sozial benachteiligten Menschen haben. Also eine Ghettoisierung nach Altersklassen und sozialer Herkunft. In solchen Ghettos kann schnell der Eindruck entstehen, sich nicht für etwas engagieren zu müssen, weil ja alle der gleichen Meinung sind. Die Mehrheiten scheinen bereits vorab festzustehen und da sie in meinem Sinne ausfallen, fühle ich mich als Ghettobewohner heimelig und wohlig davon umhüllt und verliere den Druck, mich selbst für etwas einsetzen zu müssen. Meine Mehrheit wird schon richtig entscheiden, mir selbst ist das Wetter heute zu schlecht oder der Besuch eines Konzerts zu wichtig, um raus zur Abstimmung zu gehen.

Es gibt weitere schöne Beispiele für diesen Effekt. So hatte bei den Vorentscheidungen zur Präsidentschaftswahl in den USA 2004 der Kandidat Howard Dean fast ausschließlich auf Internetaktionen gesetzt und darüber seine Kampagne organisiert. Dadurch erzielte er eine hohe Sichtbarkeit unter seinen Anhängern, unter Journalisten und anderen Nutzern dieser Medien. Erst bei den ersten Vorwahlen zeigte sich dann, dass diese hohe Sichtbarkeit in Internetkampagnen und die dort erzielten hohen Umfragewerte sich nicht in den realen Wahlergebnissen niederschlugen. Ein klarer Fall von Selbsttäuschung innerhalb des eigenen Ghettos.

Inzwischen zeigen einige Studien, dass Menschen, die in Sozialen Internetnetzwerken aktiv sind, sich deutlich weniger in ihrem realen Umfeld engagieren. Dass sie leicht das Klicken auf einen "Mag ich"-Knopf mit gesellschaftlichem Engagement verwechseln. Warum an einer Demonstration teilnehmen, kraftraubene Diskussionen mit Andersdenkenden führen, oder betroffenen Menschen mit Spenden und Taten helfen, wenn ich meine Meinung auch durch einen Klick auf einen Knopf oder das schnelle Unterzeichnen einer Petition "zeigen" kann? "Bin" ich wirklich Charlie, Paris, Brüssel, Istanbul, Aleppo oder Bagdad, nur weil ich im Internet auf einen Knopf geklickt habe? Oder weil ich mit einem Hashtag "ein Zeichen gesetzt" habe? Und für wen habe ich diesen Klick vollbracht? Für die betroffenen Menschen? Wirklich? Nicht doch eher für meine "Freunde", die mein "Zeichen" sehen, die ich dadurch wohlig umhülle und zu denen ich mich durch mein Zeichen zugehörig fühlen kann? Was nützt einem Verwundeten in Bagdad mein Klick auf den "Mag ich"-Knopf der Ärzte ohne Grenzen?

Wir müssen wieder verstehen und akzeptieren, dass Pluralismus und Meinungsvielfalt einer Symmetrie unterliegen und nicht nur etwas sind, was die betrifft, die anderer Meinung sind. Wo es Menschen gibt, die anderer Meinung sind, bin auch ich ein Mensch mit einer anderen Meinung. Nicht einmal, wenn eine dieser Meinungen grundlegenden Werten wie den Menschenrechten widerspricht, kann ich mir sicher sein, dass sie nur von "Idioten" und "Außenseitern" propagiert wird. Auch diese Kategorien sind nur subjektive Begriffe.

Das Menschenrecht auf Meinungsfreiheit zu verteidigen bedeutet zuerst einmal, andere Meinungen überhaupt wahrzunehmen und ihre Existenz zu akzeptieren. Als zweites, Toleranz gegenüber den Menschen zu zeigen, die sie vertreten. Und erst an dritter Stelle steht mein Recht, den Meinungen offen zu widersprechen, die sich nicht mit meiner decken, besonders, wenn sie meinen Überzeugungen widersprechen. Aber da steht auch die Pflicht zu widersprechen, wenn diese Meinungen sich gegen Menschen und Minderheiten richten. Denn jede artikulierte Meinung muss sich auch am ersten Artikel unseres Grundgesetzes messen lassen. Die Würde des Menschen ist unantastbar.

Die Meinungsfreiheit stellt es natürlich jeder/m Einzelnen frei, andere Menschen und ihre Meinungen zu ignorieren. Sie ist aber keine Einladung dazu, sich in Ghettos einzukuscheln und nicht mehr miteinander zu reden. Sie darf niemals dazu führen, dass ganze Bevölkerungsgruppen sich gegenseitig ignorieren. Eine Demokratie lebt nur in Diskussion und Austausch. Den Dialog abzubrechen führt geradewegs in die Ghettoisierung, zu Tunnelblick und Radikalisierung. Und zu unerwarteten Wahlergebnissen.