This is the first in our ‘From the Trenches’ series—real engineering stories from the field, not conference keynotes. What worked, what didn’t, and how we solved it.
The Problem
We were building a medical imaging analysis prototype processing 3D volumetric data. A test dataset of 512×512×256 voxels took 47 minutes to process.
The team assumed Python was the problem and began planning a full rewrite in C++. Before committing to months of work, we decided to profile the code and identify the real bottleneck.
Step 1: Profiling with cProfile
python -m cProfile -o output.prof my_script.py
Profiling revealed that the majority of time was spent inside NumPy calls and data movement overhead.
Step 2: Line-by-Line Profiling
kernprof -l -v my_script.py
Line-level profiling showed a triple nested loop computing neighborhood means for every voxel in the dataset—tens of millions of iterations.
Step 3: Understanding the Real Problem
The algorithm repeatedly recomputed overlapping neighborhoods. This redundant computation created billions of unnecessary operations.
Step 4: GPU Acceleration with CuPy
import cupy as cp
data_gpu = cp.asarray(data)
mean = cp.mean(data_gpu, axis=0)
std = cp.std(data_gpu, axis=0)
GPU acceleration reduced runtime to around 12 minutes—but the Python loops remained a bottleneck.
Step 5: Algorithm Optimization
from cupyx.scipy import ndimage
kernel = cp.ones((5,5,5)) / 125
local_mean = ndimage.convolve(data_gpu, kernel)
Replacing the brute-force neighborhood calculation with a convolution eliminated redundant computation.
Runtime dropped from 47 minutes to about 8 seconds.
Performance Results

Lessons Learned
- Profile before rewriting code.
- Algorithm improvements often outperform language rewrites.
- GPU acceleration helps only when workloads are parallelizable.
- Leverage optimized scientific libraries.
- Automate performance benchmarks.
From the Trenches Takeaway
The team initially planned a full rewrite in C++. Instead, two days of profiling and algorithm optimization produced a 340× speedup while keeping the flexibility of Python.
Before rewriting your system, measure the real bottleneck. The problem is rarely where you expect it.
