Wrapping C with Python: 3D image segmentation with region growing

saltcured · on Oct 21, 2017

People interested in mixing image processing, Python, and C code for high performance might also enjoy tinkering with a combination of Numpy and PyOpenCL. It gives you some powerful mechanisms to manipulate n-dimensional arrays and then offload some brute-force work to your GPU or multi-core CPU.

OpenCL is comparable to CUDA. It's essentially a C dialect with a lot of overlap to the OpenGL GLSL (shader language), with intrinsics for certain SIMD operations.

You write your outer data wrangling code in Python, and put your little OpenCL kernels into the program as multi-line strings, which get compiled at runtime into appropriate parallel processing routines which are dispatched by whichever OpenCL drivers you install. I've used my NVIDIA GPUs and Intel multi-core/SIMD CPUs to good effect.

This kind of parallel processing leads to turning your mind inside-out a little and using signal-processing techniques. You want think in terms of cooperative algorithms you can perform out of a large number of independent, localized operations rather than a single point of focus which sequentially wanders around a buffer.

superbatfish · on Oct 22, 2017

If you don't need GPU support, vigra is a great option for ND image processing.

conda install -c conda-forge vigra

python -c 'import vigra; help(vigra.analysis.watersheds)'

scikit-image ain't bad, and has more functions, but the functions vigra is faster if it has the function you're looking for.

Next on my list of ND array libraries to learn is xtensor.

make3 · on Oct 21, 2017

you can probably use Tensorflow without the differentiation stuff

Hydraulix989 · on Oct 21, 2017

I miss Theano

gravypod · on Oct 21, 2017

Why was the stack implemented as a linked list? Could this be turned into a block allocated array to improve cache locality (get rid of an entire int). 64%8 = 0 so you won't have any alignment issues. You'd also avoid doing free() on every loop.

On a more general note if you're using the stack to queue up subsequent computation why not just opt for tail-recursion which will be optimized out?

Also why are you using f2py rather than just writing a C module? [0]

[0] - https://csl.name/post/c-functions-python/

chestervonwinch · on Oct 21, 2017

By block allocated array you mean something like a hybrid between a fixed-size array and a linked list stack? I'm not very familiar with tail recursion, so I'll take that as a suggestion to read more on it. The answer to your last question is that f2py was easy enough to use and something I'm already familiar with :)

gravypod · on Oct 21, 2017

By block allocated array I mean.

    typedef struct { int x, y, z; } vec3;
    typedef struct { int size, i; vec3 items[]; } stack;

    static inline stack *stack_make(int size) {
    	stack *s = malloc(sizeof(stack) + size * sizeof(vec3));
    	s->size = size;
    	s->i = 0;
    }

    static inline bool stack_push(stack *s, vec3 *v) {
    	if (s->size <= s->i)
    		return false;
    	s->items[s->i++] = *v;
    	return true;
    }


    static inline vec3 *stack_pop(stack *s) {
    	return s->i == -1 ? NULL : &s->items[s->i--];
    }

If you want to dynamically grow your stack size you can implement realloc to have your data set grow/shrink but I would recommend against that. Also note that this implementation is not ideal. You obtain a reference to the internal data in stack_pop. If you push something new to your stack your pointer value will change. stack_pop should be changed to stack_pop(s, &container_vector) and populate that for consistency.

chestervonwinch · on Oct 24, 2017

The reason I didn't implement the stack with an array is that it has to be of size at most the number of voxels in the image volume. This is potentially much, much larger than size that the stack will grow.

In any case, I implemented the array stack along the lines of your post (with some modifications), and it yields some minor improvements (about 0.014 less seconds on average).