… and benchmarks

Kevin Modzelewskis has written a blog post where (as I understand it) he is trying to give reasons why Python (specifically the CPython implementation) is perceived as slow. For this, he is timing variations of the following microbenchmark:

def main():
  for j in range(20):
    for i in range(1000000):
      str(i)

main()

One of these variations is using the built-in map() function:

for j in range(20):
  list(map(str, range(1000000)))

And he also tries Cython on this benchmark. However, for whatever reason, he tries it on the second version, where the Python runtime is doing all the work, not on the first one, which is the one that is computationally heavy in user code. Unsurprisingly, compiling the second version gives almost no speed-up compared to exercising the exact same implementation of the built-in range(), str(), map() and list() functions in the interpreter.

This proves that running the same code twice does not necessarily make it faster. ;-)

So, I tried running the first microbenchmark on my side, and it gave me this:

$ python3.9 -m timeit -s 'from strbench import main' 'main()'
1 loop, best of 5: 4.05 sec per loop

Ok, so that's my baseline. Now, how fast is the same thing in the latest Cython release, 3.0a5?

$ python3.9 cythonize.py -if3 strbench.py
$ python3.9 -m timeit -s 'from strbench import main' 'main()'
1 loop, best of 5: 1.88 sec per loop

So, just by compiling this code in Cython, I got a speed-up of about 53%. It's a somewhat contrived microbenchmark, but fine, why not.

For comparison, the time that Kevin gave for the first version on his side was 2.33 seconds, then a very "don't try this at home" C implementation that lacks any sort of safety and error handling came in at 1.86 seconds (30% faster), and the last, heavily inlined version that he put together that did not just remove all the work from the inner loop took 0.91 seconds to run, which represents a speed-up of 61%.

So, with a lot of manual tuning work, he ended up with a still somewhat unsafe C implementation of 260 lines (or a bit less, if you remove blank lines), and 61% faster code.

Estimating from the relative speed of the original implementation on Kevin's and my computer, the final hand-tuned, all inlined C version is only within 10% faster than the original Python code when compiled with Cython.

The next thing I then tried was to change the generic call to str() to an f-string, i.e.

def mainf():
  for j in range(20):
    for i in range(1000000):
      d = f"{i}"

I'll make sure the compiled module is out of the way, so that the Python code is used again.

$ rm strbench*.so

$ python3.9 -m timeit -s 'from strbench import main, mainf' 'mainf()'
1 loop, best of 5: 2.25 sec per loop

This shows that f-strings are much faster in CPython, about 44% in this case. Nice. Let's see what Cython makes of this.

$ python3.9 cythonize.py -if3 strbench.py
$ python3.9 -m timeit -s 'from strbench import main, mainf' 'mainf()'
1 loop, best of 5: 750 msec per loop

That's 67% faster than CPython, and 80% faster than the original version. I also verified via profiling that the C compiler does not pull any dirty tricks here. The run actually executes the intended workload, which is not always the case when you take a microbenchmark from Python to C. :-)

So, a straight forward Python implementation of this microbenchmark is already 45% faster in CPython 3.9b1 than the original code, and that is very close to the first manually written, buggy C version. Compiling the Python code with Cython brings it down by another 67%.

Estimating from the relative speeds again, CPython 3.9b1 really seems to be a little faster than Kevin's NodeJS version of the modified benchmark, and the Cython compiled code should run about 3x as fast, in the same ballpark as the timings that he presented for PyPy (which I don't entirely trust, because PyPy might have detected that the result of the string conversion is unused and could have avoided doing it all together).

I have to say that I also cannot confirm the argument passing overhead of 31% that Kevin found. I get something like 13% for the original code with Python 3.9b1. That's still a lot, but it's not close to the time that goes into the intended work, let alone matching it. The timings are probably different for CPython 3.7, which Kevin was using. But then, going back to Py3.7 means ignoring almost two years of CPython core development.

Finally, Kevin also wrote: "I ran Cython (a Python->C converter) on the previous benchmark, and it runs in exactly the same amount of time: 2.11s. I wrote a simplified C extension in 36 lines compared to Cython's 3600, and it too runs in 2.11s." Suggestive wording aside, let me add that these 3600 lines of C code deal mostly with portability, inline optimisations and error handling. It makes a difference whether your code can be thrown away tomorrow and doesn't crash often enough for you to notice, or whether it has to run in production and match Python nuances in various forms and versions, while trying to take the best advantage of whatever the runtime environment has to offer. Plus, these code lines are generated, so they don't really cost anything. Putting together the 260 lines of C code for the benchmark certainly was a lot more costly short term, but potentially also long-term, in case such code ends up not just being throw-away code.

As always, you shouldn't believe microbenchmarks. But if you do, then please try to assure at least a somewhat fair comparison for the different approaches that you use.