People interested in mixing image processing, Python, and C code for high performance might also enjoy tinkering with a combination of Numpy and PyOpenCL. It gives you some powerful mechanisms to manipulate n-dimensional arrays and then offload some brute-force work to your GPU or multi-core CPU.
OpenCL is comparable to CUDA. It's essentially a C dialect with a lot of overlap to the OpenGL GLSL (shader language), with intrinsics for certain SIMD operations.
You write your outer data wrangling code in Python, and put your little OpenCL kernels into the program as multi-line strings, which get compiled at runtime into appropriate parallel processing routines which are dispatched by whichever OpenCL drivers you install. I've used my NVIDIA GPUs and Intel multi-core/SIMD CPUs to good effect.
This kind of parallel processing leads to turning your mind inside-out a little and using signal-processing techniques. You want think in terms of cooperative algorithms you can perform out of a large number of independent, localized operations rather than a single point of focus which sequentially wanders around a buffer.
Why was the stack implemented as a linked list? Could this be turned into a block allocated array to improve cache locality (get rid of an entire int). 64%8 = 0 so you won't have any alignment issues. You'd also avoid doing free() on every loop.
On a more general note if you're using the stack to queue up subsequent computation why not just opt for tail-recursion which will be optimized out?
Also why are you using f2py rather than just writing a C module? [0]
By block allocated array you mean something like a hybrid between a fixed-size array and a linked list stack? I'm not very familiar with tail recursion, so I'll take that as a suggestion to read more on it. The answer to your last question is that f2py was easy enough to use and something I'm already familiar with :)
If you want to dynamically grow your stack size you can implement realloc to have your data set grow/shrink but I would recommend against that. Also note that this implementation is not ideal. You obtain a reference to the internal data in stack_pop. If you push something new to your stack your pointer value will change. stack_pop should be changed to stack_pop(s, &container_vector) and populate that for consistency.
The reason I didn't implement the stack with an array is that it has to be of size at most the number of voxels in the image volume. This is potentially much, much larger than size that the stack will grow.
In any case, I implemented the array stack along the lines of your post (with some modifications), and it yields some minor improvements (about 0.014 less seconds on average).
OpenCL is comparable to CUDA. It's essentially a C dialect with a lot of overlap to the OpenGL GLSL (shader language), with intrinsics for certain SIMD operations.
You write your outer data wrangling code in Python, and put your little OpenCL kernels into the program as multi-line strings, which get compiled at runtime into appropriate parallel processing routines which are dispatched by whichever OpenCL drivers you install. I've used my NVIDIA GPUs and Intel multi-core/SIMD CPUs to good effect.
This kind of parallel processing leads to turning your mind inside-out a little and using signal-processing techniques. You want think in terms of cooperative algorithms you can perform out of a large number of independent, localized operations rather than a single point of focus which sequentially wanders around a buffer.