From the Trenches: The Lost Art of software profiling. Accelerate proof-of-concept to MVP

This is the first in our ‘From the Trenches’ series—real engineering stories from the field, not conference keynotes. What worked, what didn’t, and how we solved it.

The Problem

We were building a medical imaging analysis prototype processing 3D volumetric data. A test dataset of 512×512×256 voxels took 47 minutes to process.

The team assumed Python was the problem and began planning a full rewrite in C++. Before committing to months of work, we decided to profile the code and identify the real bottleneck.

Step 1: Profiling with cProfile

_{^{python -m cProfile -o output.prof my_script.py}}

Profiling revealed that the majority of time was spent inside NumPy calls and data movement overhead.

Step 2: Line-by-Line Profiling

^{kernprof -l -v my_script.py}

Line-level profiling showed a triple nested loop computing neighborhood means for every voxel in the dataset—tens of millions of iterations.

Step 3: Understanding the Real Problem

The algorithm repeatedly recomputed overlapping neighborhoods. This redundant computation created billions of unnecessary operations.

Step 4: GPU Acceleration with CuPy

^{import cupy as cp

data_gpu = cp.asarray(data)
mean = cp.mean(data_gpu, axis=0)
std = cp.std(data_gpu, axis=0)}

GPU acceleration reduced runtime to around 12 minutes—but the Python loops remained a bottleneck.

Step 5: Algorithm Optimization

from cupyx.scipy import ndimage

kernel = cp.ones((5,5,5)) / 125
local_mean = ndimage.convolve(data_gpu, kernel)

Replacing the brute-force neighborhood calculation with a convolution eliminated redundant computation.

Runtime dropped from 47 minutes to about 8 seconds.

Performance Results

Performance improvement with GPU accerleration on Python

Lessons Learned

Profile before rewriting code.
Algorithm improvements often outperform language rewrites.
GPU acceleration helps only when workloads are parallelizable.
Leverage optimized scientific libraries.
Automate performance benchmarks.

From the Trenches Takeaway

The team initially planned a full rewrite in C++. Instead, two days of profiling and algorithm optimization produced a 340× speedup while keeping the flexibility of Python.

Before rewriting your system, measure the real bottleneck. The problem is rarely where you expect it.