Speeding up basic object operations in Cython
Raymond Hettinger published a nice little micro-benchmark script for comparing basic operations like attribute or item access in CPython and comparing the performance across Python versions. Unsurprisingly, Cython performs quite well in comparison to the latest CPython 3.8-pre development version, executing most operations 30-50% faster. But the script allowed me to tune some more performance out of certain less well performing operations. The timings are shown below, first those for CPython 3.8-pre as a baseline, then (for comparison) the Cython timings with all optimisations disabled that can be controlled by C macros (gcc -DCYTHON_...=0), the normal (optimised) Cython timings, and the now improved version at the end.
CPython 3.8 (pre) | Cython 3.0 (no opt) | Cython 3.0 (pre) | Cython 3.0 (tuned) | |
---|---|---|---|---|
Variable and attribute read access: | ||||
read_local
|
5.5 ns
|
0.2 ns
|
0.2 ns
|
0.2 ns
|
read_nonlocal
|
6.0 ns
|
0.2 ns
|
0.2 ns
|
0.2 ns
|
read_global
|
17.9 ns
|
13.3 ns
|
2.2 ns
|
2.2 ns
|
read_builtin
|
21.0 ns
|
0.2 ns
|
0.2 ns
|
0.1 ns
|
read_classvar_from_class
|
23.7 ns
|
16.1 ns
|
14.1 ns
|
14.1 ns
|
read_classvar_from_instance
|
20.9 ns
|
11.9 ns
|
11.2 ns
|
11.0 ns
|
read_instancevar
|
31.7 ns
|
22.3 ns
|
20.8 ns
|
22.0 ns
|
read_instancevar_slots
|
25.8 ns
|
16.5 ns
|
15.3 ns
|
17.0 ns
|
read_namedtuple
|
23.6 ns
|
16.2 ns
|
13.9 ns
|
13.5 ns
|
read_boundmethod
|
32.5 ns
|
23.4 ns
|
22.2 ns
|
21.6 ns
|
Variable and attribute write access: | ||||
write_local
|
6.4 ns
|
0.2 ns
|
0.1 ns
|
0.1 ns
|
write_nonlocal
|
6.8 ns
|
0.2 ns
|
0.1 ns
|
0.1 ns
|
write_global
|
22.2 ns
|
13.2 ns
|
13.7 ns
|
13.0 ns
|
write_classvar
|
114.2 ns
|
103.2 ns
|
113.9 ns
|
94.7 ns
|
write_instancevar
|
49.1 ns
|
34.9 ns
|
28.6 ns
|
29.8 ns
|
write_instancevar_slots
|
33.4 ns
|
22.6 ns
|
16.7 ns
|
17.8 ns
|
Data structure read access: | ||||
read_list
|
23.1 ns
|
5.5 ns
|
4.0 ns
|
4.1 ns
|
read_deque
|
24.0 ns
|
5.7 ns
|
4.3 ns
|
4.4 ns
|
read_dict
|
28.7 ns
|
21.2 ns
|
16.5 ns
|
16.5 ns
|
read_strdict
|
23.3 ns
|
10.7 ns
|
10.5 ns
|
12.0 ns
|
Data structure write access: | ||||
write_list
|
28.0 ns
|
8.2 ns
|
4.3 ns
|
4.2 ns
|
write_deque
|
29.5 ns
|
8.2 ns
|
6.3 ns
|
6.4 ns
|
write_dict
|
32.9 ns
|
24.0 ns
|
21.7 ns
|
22.6 ns
|
write_strdict
|
29.2 ns
|
16.4 ns
|
15.8 ns
|
16.0 ns
|
Stack (or queue) operations: | ||||
list_append_pop
|
63.6 ns
|
67.9 ns
|
20.6 ns
|
20.5 ns
|
deque_append_pop
|
56.0 ns
|
81.5 ns
|
159.3 ns
|
46.0 ns
|
deque_append_popleft
|
58.0 ns
|
56.2 ns
|
88.1 ns
|
36.4 ns
|
Timing loop overhead: | ||||
loop_overhead
|
0.4 ns
|
0.2 ns
|
0.1 ns
|
0.2 ns
|
Some things that are worth noting:
- There is always a bit of variance across the runs, so don't get excited about a couple of percent difference.
- The read/write access to local variables is not reasonably measurable in Cython since it uses local/global C variables, and the C compiler discards any useless access to them. But don't worry, they are really fast.
- Builtins (and module global variables in Py3.6+) are cached, which explains the "close to nothing" timings for them above.
- Even with several optimisations disabled, Cython code is still visibly faster than CPython.
- The write_classvar benchmark revealed a performance problem in CPython that is being worked on.
- The deque related benchmarks revealed performance problems in Cython that are now fixed, as you can see in the last column.