Speeding up basic object operations in Cython

Stefan Behnel

2019-02-17 19:24

Raymond Hettinger published a nice little micro-benchmark script for comparing basic operations like attribute or item access in CPython and comparing the performance across Python versions. Unsurprisingly, Cython performs quite well in comparison to the latest CPython 3.8-pre development version, executing most operations 30-50% faster. But the script allowed me to tune some more performance out of certain less well performing operations. The timings are shown below, first those for CPython 3.8-pre as a baseline, then (for comparison) the Cython timings with all optimisations disabled that can be controlled by C macros (gcc -DCYTHON_...=0), the normal (optimised) Cython timings, and the now improved version at the end.

	CPython 3.8 (pre)	Cython 3.0 (no opt)	Cython 3.0 (pre)	Cython 3.0 (tuned)
Variable and attribute read access:
read_local	5.5 ns	0.2 ns	0.2 ns	0.2 ns
read_nonlocal	6.0 ns	0.2 ns	0.2 ns	0.2 ns
read_global	17.9 ns	13.3 ns	2.2 ns	2.2 ns
read_builtin	21.0 ns	0.2 ns	0.2 ns	0.1 ns
read_classvar_from_class	23.7 ns	16.1 ns	14.1 ns	14.1 ns
read_classvar_from_instance	20.9 ns	11.9 ns	11.2 ns	11.0 ns
read_instancevar	31.7 ns	22.3 ns	20.8 ns	22.0 ns
read_instancevar_slots	25.8 ns	16.5 ns	15.3 ns	17.0 ns
read_namedtuple	23.6 ns	16.2 ns	13.9 ns	13.5 ns
read_boundmethod	32.5 ns	23.4 ns	22.2 ns	21.6 ns
Variable and attribute write access:
write_local	6.4 ns	0.2 ns	0.1 ns	0.1 ns
write_nonlocal	6.8 ns	0.2 ns	0.1 ns	0.1 ns
write_global	22.2 ns	13.2 ns	13.7 ns	13.0 ns
write_classvar	114.2 ns	103.2 ns	113.9 ns	94.7 ns
write_instancevar	49.1 ns	34.9 ns	28.6 ns	29.8 ns
write_instancevar_slots	33.4 ns	22.6 ns	16.7 ns	17.8 ns
Data structure read access:
read_list	23.1 ns	5.5 ns	4.0 ns	4.1 ns
read_deque	24.0 ns	5.7 ns	4.3 ns	4.4 ns
read_dict	28.7 ns	21.2 ns	16.5 ns	16.5 ns
read_strdict	23.3 ns	10.7 ns	10.5 ns	12.0 ns
Data structure write access:
write_list	28.0 ns	8.2 ns	4.3 ns	4.2 ns
write_deque	29.5 ns	8.2 ns	6.3 ns	6.4 ns
write_dict	32.9 ns	24.0 ns	21.7 ns	22.6 ns
write_strdict	29.2 ns	16.4 ns	15.8 ns	16.0 ns
Stack (or queue) operations:
list_append_pop	63.6 ns	67.9 ns	20.6 ns	20.5 ns
deque_append_pop	56.0 ns	81.5 ns	159.3 ns	46.0 ns
deque_append_popleft	58.0 ns	56.2 ns	88.1 ns	36.4 ns
Timing loop overhead:
loop_overhead	0.4 ns	0.2 ns	0.1 ns	0.2 ns

Some things that are worth noting:

There is always a bit of variance across the runs, so don't get excited about a couple of percent difference.
The read/write access to local variables is not reasonably measurable in Cython since it uses local/global C variables, and the C compiler discards any useless access to them. But don't worry, they are really fast.
Builtins (and module global variables in Py3.6+) are cached, which explains the "close to nothing" timings for them above.
Even with several optimisations disabled, Cython code is still visibly faster than CPython.
The write_classvar benchmark revealed a performance problem in CPython that is being worked on.
The deque related benchmarks revealed performance problems in Cython that are now fixed, as you can see in the last column.