excellent language for rapid prototyping and code development, but one thing I often hear people say about using it is that it’s slow to execute. This is a particular pain point for data scientists and ML engineers, as they often perform computationally intensive operations, such as matrix multiplication, gradient descent calculations or image processing.
Over time, Python has evolved internally to address some of these issues by introducing new features to the language, such as multi-threading or rewriting existing functionality for improved performance. However, Python’s use of the Global Interpreter Lock (GIL) often hamstrung efforts like this.
Many external libraries have also been written to bridge this perceived performance gap between Python and compiled languages such as Java. Perhaps the most used and well-known of these is the NumPy library. Implemented in the C language, NumPy was designed from the ground up to support multiple CPU cores and super-fast numerical and array processing.
There are alternatives to NumPy, and in a recent TDS article, I introduced the numexpr library, which, in many use cases, can even outperform NumPy. If you’re interested in learning more, I’ll include a link to that story at the end of this article.
Another external library that is very effective is Numba. Numba utilises a Just-in-Time (JIT) compiler for Python, which translates a subset of Python and NumPy code into fast machine code at runtime. It is designed to accelerate numerical and scientific computing tasks by leveraging LLVM (Low-Level Virtual Machine) compiler infrastructure.
In this article, I would like to discuss another runtime-enhancing external library, Cython. It’s one of the most performant Python libraries but also one of the least understood and used. I think this is at least partially because you have to get your hands a little bit dirty and make some changes to your original code. But if you follow the simple four-step plan I’ll outline below, the performance benefits you can achieve will make it more than worthwhile.
What is Cython?
If you haven’t heard of Cython, it’s a superset of Python designed to provide C-like performance with code written mainly in Python. It allows for converting Python code into C code, which can then be compiled into shared libraries that can be imported into Python just like regular Python modules. This process results in the performance benefits of C while maintaining the readability of Python.
I’ll showcase the exact benefits you can achieve by converting your code to use Cython, examining three use cases and providing the four steps required to convert your existing Python code, along with comparative timings for each run.
Setting up a development environment
Before continuing, we should set up a separate development environment for coding to keep our project dependencies separate. I’ll be using WSL2 Ubuntu for Windows and a Jupyter Notebook for code development. I use the UV package manager to set up my development environment, but feel free to use whatever tools and methods suit you.
$ uv init cython-test
$ cd cython-test
$ uv venv
$ source .venv/bin/activate
(cython-test) $ uv pip install cython jupyter numpy pillow matplotlib
Now, type ‘jupyter notebook’ into your command prompt. You should see a notebook open in your browser. If that doesn’t happen automatically, what you’ll likely see is a screenful of information after running the Jupyter Notebook command. Near the bottom of that, there will be a URL you should copy and paste into your browser to initiate the Jupyter Notebook.
Your URL will be different to mine, but it should look something like this:-
http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69d
Example 1 – Speeding up for loops
Before we start using Cython, let’s begin with a regular Python function and time how long it takes to run. This will be our base benchmark.
We’ll code a simple double-for-loop function that takes a few seconds to run, then use Cython to speed it up and measure the differences in runtime between the two methods.
Here is our baseline standard Python code.
# sum_of_squares.py
import timeit
# Define the standard Python function
def slow_sum_of_squares(n):
total = 0
for i in range(n):
for j in range(n):
total += i * i + j * j
return total
# Benchmark the Python function
print("Python function execution time:")
print("timeit:", timeit.timeit(
lambda: slow_sum_of_squares(20000),
number=1))
On my system, the above code produces the following output.
Python function execution time:
13.135973724005453
Let’s see how much of an improvement Cython makes of it.
The four-step plan for effective Cython use.
Using Cython to boost your code run-time in a Jupyter Notebook is a simple 4-step process.
Don’t worry if you’re not a Notebook user, as I’ll show how to convert regular Python .py files to use Cython later on.
1/ In the first cell of your notebook, load the Cython extension by typing this command.
%load_ext Cython
2/ For any subsequent cells that contain Python code that you wish to run using cython, add the %%cython magic command before the code. For example,
%%cython
def myfunction():
etc ...
...
3/ Function definitions that contain parameters must be correctly typed.
4/ Lastly, all variables must be typed appropriately by using the cdef directive. Also, where it makes sense, use functions from the standard C library (available in Cython using the from libc.stdlib directive).
Taking our original Python code as an example, this is what it needs to look like to be ready to run in a notebook using cython after applying all four steps above.
%%cython
def fast_sum_of_squares(int n):
cdef int total = 0
cdef int i, j
for i in range(n):
for j in range(n):
total += i * i + j * j
return total
import timeit
print("Cython function execution time:")
print("timeit:", timeit.timeit(
lambda: fast_sum_of_squares(20000),
number=1))
As I hope you can see, the reality of converting your code is much easier than the four procedural steps required might suggest.
The runtime of the above code was impressive. On my system, this new cython code produces the following output.
Cython function execution time:
0.15829777799808653
That is an over 80x speed-up.
Example 2 — Calculate pi using Monte Carlo
For our second example, we’ll examine a more complex use case, the foundation of which has numerous real-world applications.
An area where Cython can show significant performance improvement is in numerical simulations, particularly those involving heavy computation, such as Monte Carlo (MC) simulations. Monte Carlo simulations involve running many iterations of a random process to estimate the properties of a system. MC applies to a wide variety of study fields, including climate and atmospheric science, computer graphics, AI search and quantitative finance. It is almost always a very computationally intensive process.
To illustrate, we’ll use Monte Carlo in a simplified manner to calculate the value of Pi. This is a well-known example where we take a square with a side length of one unit and inscribe a quarter circle inside it with a radius of one unit, as shown here.
The ratio of the area of the quarter circle to the area of the square is, obviously, (Pi/4).
So, if we consider many random (x,y) points that all lie within or on the bounds of the square, as the total number of these points tends to infinity, the ratio of points that lie on or inside the quarter circle to the total number of points tends towards Pi /4. We then multiply this value by 4 to obtain the value of Pi itself.
Here is some typical Python code you might use to model this.
import random
import time
def monte_carlo_pi(num_samples):
inside_circle = 0
for _ in range(num_samples):
x = random.uniform(0, 1)
y = random.uniform(0, 1)
if (x**2) + (y**2)
Running this produced the following timing result.
Estimated Pi (Python): 3.14197216
Execution Time (Python): 20.67279839515686 seconds
Now, here is the Cython implementation we get by following our four-step process.
%%cython
import cython
import random
from libc.stdlib cimport rand, RAND_MAX
@cython.boundscheck(False)
@cython.wraparound(False)
def monte_carlo_pi(int num_samples):
cdef int inside_circle = 0
cdef int i
cdef double x, y
for i in range(num_samples):
x = rand() / RAND_MAX
y = rand() / RAND_MAX
if (x**2) + (y**2)
And here is the new output.
Estimated Pi (Cython): 3.1415012
Execution Time (Cython): 1.9987852573394775 seconds
Once again, that’s a pretty impressive 10x speed-up for the Cython version.
One thing we did in this code example that we didn’t in the other is import some external libraries from the C standard library. That was the line,
from libc.stdlib cimport rand, RAND_MAX
The cimport command is a Cython keyword used to import C functions, variables, constants, and types. We used it to import optimised C language versions of the equivalent random.uniform() Python functions.
Example 3— image manipulation
For our final example, we’ll do some image manipulation. Specifically, some image convolution, which is a common operation in image processing. There are many use cases for image convolution. We’re going to use it to try to sharpen the slightly blurry image shown below.
First, here is the regular Python code.
from PIL import Image
import numpy as np
from scipy.signal import convolve2d
import time
import os
import matplotlib.pyplot as plt
def sharpen_image_color(image):
# Start timing
start_time = time.time()
# Convert image to RGB in case it's not already
image = image.convert('RGB')
# Define a sharpening kernel
kernel = np.array([[0, -1, 0],
[-1, 5, -1],
[0, -1, 0]])
# Convert image to numpy array
image_array = np.array(image)
# Debugging: Check input values
print("Input array values: Min =", image_array.min(), "Max =", image_array.max())
# Prepare an empty array for the sharpened image
sharpened_array = np.zeros_like(image_array)
# Apply the convolution kernel to each channel (assuming RGB image)
for i in range(3):
channel = image_array[:, :, i]
# Perform convolution
convolved_channel = convolve2d(channel, kernel, mode='same', boundary='wrap')
# Clip values to be in the range [0, 255]
convolved_channel = np.clip(convolved_channel, 0, 255)
# Store back in the sharpened array
sharpened_array[:, :, i] = convolved_channel.astype(np.uint8)
# Debugging: Check output values
print("Sharpened array values: Min =", sharpened_array.min(), "Max =", sharpened_array.max())
# Convert array back to image
sharpened_image = Image.fromarray(sharpened_array)
# End timing
duration = time.time() - start_time
print(f"Processing time: {duration:.4f} seconds")
return sharpened_image
# Correct path for WSL2 accessing Windows filesystem
image_path = '/mnt/d/images/taj_mahal.png'
image = Image.open(image_path)
# Sharpen the image
sharpened_image = sharpen_image_color(image)
if sharpened_image:
# Show using PIL's built-in show method (for debugging)
#sharpened_image.show(title="Sharpened Image (PIL Show)")
# Display the original and sharpened images using Matplotlib
fig, axs = plt.subplots(1, 2, figsize=(15, 7))
# Original image
axs[0].imshow(image)
axs[0].set_title("Original Image")
axs[0].axis('off')
# Sharpened image
axs[1].imshow(sharpened_image)
axs[1].set_title("Sharpened Image")
axs[1].axis('off')
# Show both images side by side
plt.show()
else:
print("Failed to generate sharpened image.")
The output is this.
Input array values: Min = 0 Max = 255
Sharpened array values: Min = 0 Max = 255
Processing time: 0.1034 seconds
Let’s see if Cython can beat that run time of 0.1034 seconds.
%%cython
# cython: language_level=3
# distutils: define_macros=NPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION
import numpy as np
cimport numpy as np
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def sharpen_image_cython(np.ndarray[np.uint8_t, ndim=3] image_array):
# Define sharpening kernel
cdef int kernel[3][3]
kernel[0][0] = 0
kernel[0][1] = -1
kernel[0][2] = 0
kernel[1][0] = -1
kernel[1][1] = 5
kernel[1][2] = -1
kernel[2][0] = 0
kernel[2][1] = -1
kernel[2][2] = 0
# Declare variables outside of loops
cdef int height = image_array.shape[0]
cdef int width = image_array.shape[1]
cdef int channel, i, j, ki, kj
cdef int value
# Prepare an empty array for the sharpened image
cdef np.ndarray[np.uint8_t, ndim=3] sharpened_array = np.zeros_like(image_array)
# Convolve each channel separately
for channel in range(3): # Iterate over RGB channels
for i in range(1, height - 1):
for j in range(1, width - 1):
value = 0 # Reset value at each pixel
# Apply the kernel
for ki in range(-1, 2):
for kj in range(-1, 2):
value += kernel[ki + 1][kj + 1] * image_array[i + ki, j + kj, channel]
# Clip values to be between 0 and 255
sharpened_array[i, j, channel] = min(max(value, 0), 255)
return sharpened_array
# Python part of the code
from PIL import Image
import numpy as np
import time as py_time # Renaming the Python time module to avoid conflict
import matplotlib.pyplot as plt
# Load the input image
image_path = '/mnt/d/images/taj_mahal.png'
image = Image.open(image_path).convert('RGB')
# Convert the image to a NumPy array
image_array = np.array(image)
# Time the sharpening with Cython
start_time = py_time.time()
sharpened_array = sharpen_image_cython(image_array)
cython_time = py_time.time() - start_time
# Convert back to an image for displaying
sharpened_image = Image.fromarray(sharpened_array)
# Display the original and sharpened image
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.imshow(image)
plt.title("Original Image")
plt.subplot(1, 2, 2)
plt.imshow(sharpened_image)
plt.title("Sharpened Image")
plt.show()
# Print the time taken for Cython processing
print(f"Processing time with Cython: {cython_time:.4f} seconds")
The output is,
Both programs performed well, but Cython was nearly 25 times faster.
What about running Cython outside a Notebook environment?
So far, everything I’ve shown you assumes you’re running your code inside a Jupyter Notebook. The reason I did this is that it’s the easiest way to introduce Cython and get some code up and running quickly. While the Notebook environment is extremely popular among Python developers, a huge amount of Python code is still contained in regular .py files and run from a terminal using the Python command.
If that’s your primary mode of coding and running Python scripts, the %load_ext and %%cython IPython magic commands won’t work since those are only understood by Jupyter/IPython.
So, here’s how to adapt my four-step Cython conversion process if you’re running your code as a regular Python script.
Let’s take my first sum_of_squares example to showcase this.
1/ Create a .pyx file instead of using %%cython
Move your Cython-enhanced code into a file named, for example:-
sum_of_squares.pyx
# sun_of_squares.pyx
def fast_sum_of_squares(int n):
cdef int total = 0
cdef int i, j
for i in range(n):
for j in range(n):
total += i * i + j * j
return total
All we did was remove the %%cython directive and the timing code (which will now be in the calling function)
2/ Create a setup.py file to compile your .pyx file
# setup.py
from setuptools import setup
from Cython.Build import cythonize
setup(
name="cython-test",
ext_modules=cythonize("sum_of_squares.pyx", language_level=3),
py_modules=["sum_of_squares"], # Explicitly state the module
zip_safe=False,
)
3/ Run the setup.py file using this command,
$ python setup.py build_ext --inplace
running build_ext
copying build/lib.linux-x86_64-cpython-311/sum_of_squares.cpython-311-x86_64-linux-g
4/ Create a regular Python module to call our Cython code, as shown below, and then run it.
# main.py
import time, timeit
from sum_of_squares import fast_sum_of_squares
start = time.time()
result = fast_sum_of_squares(20000)
print("timeit:", timeit.timeit(
lambda: fast_sum_of_squares(20000),
number=1))
$ python main.py
timeit: 0.14675087109208107
Summary
Hopefully, I’ve convinced you of the efficacy of using the Cython library in your code. Although it might seem a bit complicated at first sight, with a little effort, you can get incredible performance enhancements to your run times over using regular Python, even when using fast numerical libraries such as NumPy.
I provided a four-step process to convert your regular Python code to use Cython for running within Jupyter Notebook environments. Additionally, I explained the steps required to run Cython code from the command line outside a Notebook environment.
Finally, I reinforced the above by showcasing examples of converting regular Python code to use Cython.
In the three examples I showed, we achieved gains of 80x, 10x and 25x speed-ups, which is not too shabby at all.
As promised, here is a link to my previous TDS article on utilising the numexpr library to accelerate Python code.
Source link
#Run #Python #Code #80x #Faster #Cython #Library