The most simple Difference of Gaussian image pyramid with OpenCL

06 Aug 2023 - tsp
Last update 06 Aug 2023
Reading time 22 mins

The following blog article summarizes a quick implementation of a Difference of Gaussian (DoG) pyramid as approximation of a Laplacian of a Gaussian pyramid. It does not utilize optimizations so performance (around 10ms for a simple 6 octave pyramid) is rather low. Of course all timings taken here are dependent on the CPU and GPU that has been used. To generate the graphs and timing shown in this blog article an AMD Radeon R9 360 has been used with a Xeon E3 host CPU on FreeBSD 13.

Theory

Image Pyramids are often used during machine vision. They usually are used to create multi-scale image representations by repeated application of downsampling. To reduce resolution of a picture one usually performs subsampling after applying (Gaussian) blur. The blurring step is necessary to prevent total loss of high frequency information (for example think about an binary image where every second pixel in every second row is set to white and every other to black where you downsample by a factor of two - then you only would get a single image. When applying blur it would be gray afterwards). This still forms kind of a low pass since one looses high frequency information inside the images - and it forms the different scale spaces. Depending on application one can also device to not perform blurring but applying the mean to build the downsampled version - how one does this depends entirely on the application (calculating a mean for 9 pixels for example would only require only 8 additions and one multiplication, applying an 3x3 Gaussian blur would require 9 multiplications and 8 additions so way more operations).

Inside the scale spaces one often performs bandpass filtering or searching for maxima and minima. This is usually done by calculating the derivative of the image. This can be approximated by applying a Gaussian blur multiple times and calculating differences between the resulting images:

[ f(\sigma_1; \sigma_2) = I * \frac{1}{\sigma_1 \sqrt{2 \pi}} e^{-\frac{x^2 - y^2}{2 \sigma_1^2}} - I * \frac{1}{\sigma_2 \sqrt{2 \pi}} e^{-\frac{x^2 - y^2}{2 \sigma_2^2}} ]

Even though one performs a convolution with the image this yields

[ f(\sigma_1; \sigma_2) = I * \left( \frac{1}{\sigma_1 \sqrt{2 \pi}} e^{-\frac{x^2 - y^2}{2 \sigma_1^2}} - \frac{1}{\sigma_2 \sqrt{2 \pi}} e^{-\frac{x^2 - y^2}{2 \sigma_2^2}} \right) ]

By building those differences from can approximate the Laplacian of a Gaussian well enough - this is an approach that’s often called Lowe’s Pyramid scheme:

Since one then only wants to perform scale space maxima one can show that it’s sufficient to calculate $3$ difference levels inside each octave - this can be verified by checking how many stable keypoints one can find when extending the number of scale spaces which has been done extensively in literature.

Implementing image and difference of Gaussian (DoG) pyramids in OpenCL

It has been shown previously how to implement convolution more efficient on the GPU than on the CPU. The convolution kernel had been pretty simple. Now the next step will be to implement the whole subsampling and blurring for the scale spaces as well as the difference of Gaussian pyramids on the GPU.

What’s necessary to build a DoG Pyramid?

OpenCL initialization

The first step is to initialize the OpenCL framework. Since I’m going to test the kernels in Python this code uses pyopencl. The very simple testbench simply fetches a list of all platforms and .their devices and then selects a predefined one (usedPlatform and usedDevice). Since this code is only thought as a testbench for the OpenCL code there is no external configuration option. A detailed description can be found in a previous blog article

The initialization creates or initializes:

import pyopencl as cl
import numpy as np
import matplotlib.pyplot as plt

from PIL import Image

from time import time

platforms = cl.get_platforms()
for idxPlatform, platform in enumerate(platforms):
    devices = platform.get_devices()
    print(f"Platform {platform.name}:")
    for idxDevice, dev in enumerate(devices):
        print(f"   {idxDevice}: {dev.name}")

usedPlatform = 0
usedDevice = 0

clPlatform = platforms[usedPlatform]
clDevice = clPlatform.get_devices()[usedDevice]
clContext = cl.Context([ clDevice ])
clQueue = cl.CommandQueue(clContext, clDevice)

Defining the grayscale kernel

The first step for most image processing algorithms is grayscaling. This of course reduces the amount of data and especially the complexity - but also reduces the dimensionality and amount of information in your data. The grayscale kernel will accept an RGBA image with 4 bytes per pixel, one byte per channel - the alpha channel will be ignored. It then grayscales with the factors specified in ITU-R BT.709 standard that’s also used in HDTV to compute the luma component. It then clamps the output value to the 0 to 255 range and writes it into a simple output array. For a quick test it’s assumed all images are RGBA images - in a real world scenario one will have to support a variety of input formats (at least RGB, RGBA and YUV frames as well as accepting already grayscaled images).

The ITU-R BT.709 formula assigns the luma component as:

$Y = 0.2126 * R + 0.7152 * G + 0.0722 * B$

This tries to represent the sensitivity of the human eye to different colors in some way. Since all pixels are independent this is a very simple kernel:

clkernel_Grayscale_Tpl = '''//CL//

#ifndef GRAYSCALE_FACTOR_R
    #define GRAYSCALE_FACTOR_R 0.2126
#endif
#ifndef GRAYSCALE_FACTOR_G
    #define GRAYSCALE_FACTOR_G 0.7152
#endif
#ifndef GRAYSCALE_FACTOR_B
    #define GRAYSCALE_FACTOR_B 0.0722
#endif

__kernel void grayscale_RGBA2Pyrabuffer(
    __global unsigned char* lpRGBAIn,
    __global unsigned char* lpPyraBufferOut
) {
    int y = get_global_id(0);
    int x = get_global_id(1);

    float gray_value = ((float)(lpRGBAIn[(y + x * IMAGEHEIGHT) * 4 + 0])) * GRAYSCALE_FACTOR_R + ((float)(lpRGBAIn[(y + x * IMAGEHEIGHT) * 4 + 1])) * GRAYSCALE_FACTOR_G + ((float)(lpRGBAIn[(y + x * IMAGEHEIGHT) * 4 + 2])) * GRAYSCALE_FACTOR_B;

    if (gray_value > 255) {
        gray_value = 255.0;
    }
    if (gray_value < 1) {
        gray_value = 0;
    }

    lpPyraBufferOut[(y + x * 2 * IMAGEHEIGHT)] = (unsigned char)gray_value;
}
'''

#
# clkernel_Grayscale = '''
# #define IMAGEHEIGHT {0}
# #define IMAGEWIDTH {1}
# '''.format(image_input.shape[1], image_input.shape[0])  + clkernel_Grayscale_Tpl
#
# clProg_Grayscale = cl.Program(clContext, clkernel_Grayscale).build()

Defining the (special) convolution kernel

This kernel is used to perform convolution with a Gaussian in horizontal and vertical direction only and performs simple repeating of border pixels - different modes will require a different implementation, for example using preprocessor definitions. It reads from the same buffer as it writes into and only accepts a source and target index in the horizontal and vertical direction (which is used to apply the convolution only to a window inside a buffer to allow all images to be stored side by side - in the same buffer object) as well as the selector for the horizontal or vertical convolution. The simple implementation of a convolution is described in a previous blog article.

In addition a short Python function is supplied that initializes the buffer containing the Gaussian kernel constants. Those are calculated on the host instead of being calculated on the GPU on the fly.

def gaussiankernel(width, sigma):
    if width % 2 != 1:
        raise ValueError("Require odd kernel size")

    r = np.empty((width,))

    for k in range(int(width/2) + 1):
        r[int(width/2) + k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))
        r[int(width/2) - k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))

    # Apply norm ...
    r = r / np.sum(r)

    return r

#imagekernel_GaussianBlurGrayscalePyramid = gaussiankernel(5, np.sqrt(2))
imagekernel_GaussianBlurGrayscalePyramid = gaussiankernel(7, np.sqrt(2))

clkernel_GaussianBlurGrayscalePyramid = '''//CL//
__kernel void convolveGrayscalePyramid(
    __global unsigned char* lpBuffer,
    __global unsigned char* lpTempBuffer,
    __constant float* lpKernel,

    int sourceX, int sourceY,
    int destX, destY
) {
    int x = get_global_id(1);
    int y = get_global_id(0);

    float v = 0.0;

    /* Read the pixel values and apply Gaussian weights */

    for(int i = -KERNELHALFWIDTH; i <= KERNELHALFWIDTH; i=i+1) {
        #if defined(CONVOLVE_HORIZONTAL)
            int realx = x + i;
            if(realx < 0) {
                realx = 0;
            } else if(realx >= IMAGEWIDTH) {
                realx = IMAGEWIDTH-1;
            }

            float kv =  lpKernel[i + KERNELHALFWIDTH];

            realx = realx + sourceX * IMAGEWIDTH;
            int realy = y + sourceY * IMAGEHEIGHT;

            v = v + lpBuffer[realy + realx * 2 * IMAGEHEIGHT] * kv;
        #else
            int realy = y + i;
            if(realy < 0) {
                realy = 0;
            } else if(realy >= IMAGEHEIGHT) {
                realy = IMAGEHEIGHT-1;
            }

            float kv =  lpKernel[i + KERNELHALFWIDTH];

            v = v + lpTempBuffer[realy + x * IMAGEHEIGHT] * kv;
        #endif
    }

    if(v > 255) {
        v = 255;
    } else if(v < 1) {
        v = 0;
    }

    #if defined(CONVOLVE_HORIZONTAL)
        lpTempBuffer[y + (x * IMAGEHEIGHT)] = (unsigned char)v;
    #else
        lpBuffer[(y + (destY * IMAGEHEIGHT)) + ((x + (destX * IMAGEWIDTH)) * 2 * IMAGEHEIGHT)] = (unsigned char)v;
    #endif
}
'''

# clkernel = '''
# #define KERNELHALFWIDTH {0}
# #define IMAGEHEIGHT {1}
# #define IMAGEWIDTH {2}
# #define CONVOLVE_HORIZONTAL 1
# '''.format(int(len(imagekernel)/2), image_input.shape[1], image_input.shape[0])  + clkernel

# clProgH = cl.Program(clContext, clkernel).build()

# clkernel = '''
# #define KERNELHALFWIDTH {0}
# #define IMAGEHEIGHT {1}
# #define IMAGEWIDTH {2}
# '''.format(int(len(imagekernel)/2), image_input.shape[1], image_input.shape[0])  + clkernel

# clProgV = cl.Program(clContext, clkernel).build()

Definition of the subsampling kernel

To decrease the resolution of an image it will be first blurred using a Gaussian (since this will also be done for the scale spaces anyways) - then one can perform simple subsampling. The subsampling kernel reduces the resolution of an image by a factor of two in each direction. It always reads from the third image in the source buffer and writes into the first image in the target buffer. It will simply get launched for half the number of input pixels - again every pixel is independent of each other so the kernel is very simple:

clkernel_SubsampleGrayscalePyramid = '''//CL//
__kernel void subsampleGrayscalePyramid(
    __global unsigned char* lpPyramidLevelIn,
    __global unsigned char* lpPyramidLevelOut
) {
    int x = get_global_id(1);
    int y = get_global_id(0);

    // We source from the third image ...
    float v = lpPyramidLevelIn[(y*2) + IMAGEHEIGHT + ((x*2) * 2 * IMAGEHEIGHT)];

    // And write into the first in the output ... not using IMAGEHEIGHT >> 2 since requiring * 2 again
    lpPyramidLevelOut[y + (x * (IMAGEHEIGHT))] = v;
}
'''

# clkernel = '''
# #define IMAGEHEIGHT {0}
# #define IMAGEWIDTH {1}
# '''.format(image_input.shape[1], image_input.shape[0])  + clkernel

# clProgSubsample = cl.Program(clContext, clkernel).build()

Definition of the difference kernel

The last step for a Difference of Gaussian (DoG) pyramid is the difference step. It will subtract all scale images in a single octave at once and write three different images at the same time. This might not be the wisest implementation due to caching issues - it’s just one of the most simple ones and enough for a first try. The kernel accepts the octaves buffer as well as the single difference map output buffer. In addition the source width and height is specified together with the target offset to locate the position of the fourth image of the previous octaves difference images.

# Definitions:
#  IMAGEHEIGHT_DIFFPYRAMID
#  IMAGEWIDTH_DIFFPYRAMID

#define IMAGEHEIGHT_DIFFPYRAMID {0}
#define IMAGEWIDTH_DIFFPYRAMID {1}

clkernel_DifferencePyramidDifference = '''//CL//
__kernel void grayscalePyramidDifference(
    __global unsigned char* lpPyramidLevelIn,
    __global unsigned char* lpDifferenceMapOut,

    int targetOffset_X, int targetOffset_Y,
    int sourceWidth, int sourceHeight
) {
    int x = get_global_id(1);
    int y = get_global_id(0);

    float v1 = lpPyramidLevelIn[y + (x * sourceHeight)];
    float v2 = lpPyramidLevelIn[y + (sourceHeight / 2) + (x * sourceHeight)];
    float v3 = lpPyramidLevelIn[y + ((x + sourceWidth / 2) * sourceHeight)];
    float v4 = lpPyramidLevelIn[y + sourceHeight / 2 + ((x + sourceWidth / 2) * sourceHeight)];

    lpDifferenceMapOut[(y + targetOffset_Y) +  (x + targetOffset_X) * IMAGEHEIGHT_DIFFPYRAMID] = v3 - v1;
    lpDifferenceMapOut[(y + targetOffset_Y + (sourceHeight >> 1)) +  (x + targetOffset_X) * IMAGEHEIGHT_DIFFPYRAMID] = v2 - v3;
    lpDifferenceMapOut[(y + targetOffset_Y) +  (x + targetOffset_X + (sourceWidth >> 1)) * IMAGEHEIGHT_DIFFPYRAMID] = v4 - v2;
}
'''

Quick tests

The following is just a quick run of all kernels after each other to check they really work as expected before assembling a whole pyramid calculation function

Quick test of the grayscale kernel

# Load image

image_input = np.asarray(Image.open("imagetest01.jpg").convert('RGBA'), dtype = np.uint8)

h, w, c = image_input.shape[0], image_input.shape[1], image_input.shape[2]
print(f"Source image has dimensions {h} x {w} x {c}")

# Compile our kernel
clkernel_Grayscale = '''
#define IMAGEHEIGHT {0}
#define IMAGEWIDTH {1}
'''.format(image_input.shape[1], image_input.shape[0])  + clkernel_Grayscale_Tpl

clProg_Grayscale = cl.Program(clContext, clkernel_Grayscale).build()

# Image input buffer

clBufferIn = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = image_input)

# Would require / 4 for the ARGB -> Grayscale and * 4 for the 4 images so the size stays the same ...
npBuffer_Scales = [
    np.empty(clBufferIn.get_info(cl.mem_info.SIZE), dtype=np.uint8),
    np.empty(int(clBufferIn.get_info(cl.mem_info.SIZE) / 4), dtype=np.uint8),
]
clBuffer_Scales = [
    cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npBuffer_Scales[0]),
    cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npBuffer_Scales[1])
]

krnlGray = clProg_Grayscale.grayscale_RGBA2Pyrabuffer

krnlGray(clQueue, (w, h,), None , clBufferIn, clBuffer_Scales[0])
cl.enqueue_copy(clQueue, npBuffer_Scales[0], clBuffer_Scales[0]) #, origin=(0,0), region=(w,h))

plt.imshow(npBuffer_Scales[0].reshape((h*2, w*2)))

# Release buffers

#for buf in clBuffer_Scales:
#    buf.release()

Simple test of the grayscale kernel

Quick test of Gaussian blur kernel

clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
#define CONVOLVE_HORIZONTAL 1
'''.format(int(len(imagekernel_GaussianBlurGrayscalePyramid)/2), image_input.shape[1], image_input.shape[0])  + clkernel_GaussianBlurGrayscalePyramid

clProgH = cl.Program(clContext, clkernel).build()

clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
'''.format(int(len(imagekernel_GaussianBlurGrayscalePyramid)/2), image_input.shape[1], image_input.shape[0])  + clkernel_GaussianBlurGrayscalePyramid

clProgV = cl.Program(clContext, clkernel).build()

krnlH = clProgH.convolveGrayscalePyramid
krnlV = clProgV.convolveGrayscalePyramid

npKernel = np.asarray(imagekernel_GaussianBlurGrayscalePyramid, dtype = np.float32)
clBufferKernel = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = npKernel)

npTempBuffer = np.empty((h, w), dtype = np.uint8)
clTempBuffer = cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npTempBuffer)

krnlH(clQueue, (w, h,), None , clBuffer_Scales[0], clTempBuffer, clBufferKernel, np.int32(0), np.int32(0), np.int32(0), np.int32(1))
krnlV(clQueue, (w, h,), None , clBuffer_Scales[0], clTempBuffer, clBufferKernel, np.int32(0), np.int32(0), np.int32(0), np.int32(1))

krnlH(clQueue, (w, h,), None , clBuffer_Scales[0], clTempBuffer, clBufferKernel, np.int32(0), np.int32(1), np.int32(1), np.int32(0))
krnlV(clQueue, (w, h,), None , clBuffer_Scales[0], clTempBuffer, clBufferKernel, np.int32(0), np.int32(1), np.int32(1), np.int32(0))

krnlH(clQueue, (w, h,), None , clBuffer_Scales[0], clTempBuffer, clBufferKernel, np.int32(1), np.int32(0), np.int32(1), np.int32(1))
krnlV(clQueue, (w, h,), None , clBuffer_Scales[0], clTempBuffer, clBufferKernel, np.int32(1), np.int32(0), np.int32(1), np.int32(1))

cl.enqueue_copy(clQueue, npBuffer_Scales[0], clBuffer_Scales[0]) #, origin=(0,0), region=(w,h))

plt.imshow(npBuffer_Scales[0].reshape((h*2, w*2)))

Simple test of the Gaussian blur kernel

Quick test of subsampling

clkernel = '''
#define IMAGEHEIGHT {0}
#define IMAGEWIDTH {1}
'''.format(image_input.shape[1], image_input.shape[0])  + clkernel_SubsampleGrayscalePyramid

clProgSubsample = cl.Program(clContext, clkernel).build()

krnlSubsample = clProgSubsample.subsampleGrayscalePyramid

krnlSubsample(clQueue, (int(w / 2), int(h / 2),), None, clBuffer_Scales[0], clBuffer_Scales[1])

cl.enqueue_copy(clQueue, npBuffer_Scales[1], clBuffer_Scales[1])

plt.imshow(npBuffer_Scales[1].reshape((h, w)))

Simple test of the subsampling kernel

#clBufferKernel.release()
for buf in clBuffer_Scales:
    buf.release()

Building a whole Pyramid framework

Since the kernels seem to work - now let’s build the whole framework for a pyramid and do some performance tests of this very simple implementation. It will just load the input image using Pillow, convert it into a native array using numpy and transfer it onto the GPU. Note that the data transfer to and from the GPU will not be counted in the performance evaluation. This is done since it’s assumed that the images are already available on the GPU when executing such an task and that there will be more processing being done on the GPU before transferring the images back.

The test program works very simple:

Then comes the execution phase. For each repeated iteration - which are done to get a more clean performance measurement:

The execution phase is times multiple times to get a rough feeling about the performance of the OpenCL implementation

nScales = 6
showDebugPlots = True

# Load image

#image_input = np.asarray(Image.open("imagetest01.jpg").convert('RGBA'), dtype = np.uint8)
image_input = np.asarray(Image.open("butterfly_external_testimage.jpg").convert('RGBA'), dtype = np.uint8)
h, w, c = image_input.shape[0], image_input.shape[1], image_input.shape[2]
print(f"Source image has dimensions {h} x {w} x {c}")

# Compile our kernel
clkernel_Grayscale = '''
#define IMAGEHEIGHT {0}
#define IMAGEWIDTH {1}
'''.format(image_input.shape[1], image_input.shape[0])  + clkernel_Grayscale_Tpl

clProg_Grayscale = cl.Program(clContext, clkernel_Grayscale).build()
krnlGray = clProg_Grayscale.grayscale_RGBA2Pyrabuffer

# Image input buffer

clBufferIn = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = image_input)

# Pyramid output buffer

npPyramidBuffer = np.empty((image_input.shape[1] * 2, image_input.shape[0] * 2), dtype = np.uint8)
clPyramidBuffer = cl.Buffer(clContext, cl.mem_flags.WRITE_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = npPyramidBuffer)

# Difference Kernel

clkernel_Difference = '''
#define IMAGEHEIGHT_DIFFPYRAMID {0}
#define IMAGEWIDTH_DIFFPYRAMID {1}
'''.format(int(image_input.shape[1] * 2), int(image_input.shape[0] * 2)) + clkernel_DifferencePyramidDifference
clProg_Difference = cl.Program(clContext, clkernel_Difference).build()
krnlDifference = clProg_Difference.grayscalePyramidDifference

# Would require / 4 for the ARGB -> Grayscale and * 4 for the 4 images so the size stays the same ...
curSize = clBufferIn.get_info(cl.mem_info.SIZE)
shapeX = image_input.shape[1]
shapeY = image_input.shape[0]

clBuffer_Scales = []
npBuffer_Scales = []
clProgH = []
clProgV = []
clKrnlH = []
clKrnlV = []
clProgSubsample = []
clKrnlSubsample = []
npTempBuffer = []
clTempBuffer = []

for iScale in range(nScales):
    print(f"Scale {iScale}: {shapeX} x {shapeY} -> {curSize}")

    npBuffer_Scales.append(np.empty(curSize, dtype=np.uint8))
    clBuffer_Scales.append(cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npBuffer_Scales[-1]))

    # Blurring kernels
    clkernel = '''
        #define KERNELHALFWIDTH {0}
        #define IMAGEHEIGHT {1}
        #define IMAGEWIDTH {2}
        #define CONVOLVE_HORIZONTAL 1
    '''.format(int(len(imagekernel_GaussianBlurGrayscalePyramid)/2), shapeX, shapeY)  + clkernel_GaussianBlurGrayscalePyramid
    clProgH.append(cl.Program(clContext, clkernel).build())

    clkernel = '''
        #define KERNELHALFWIDTH {0}
        #define IMAGEHEIGHT {1}
        #define IMAGEWIDTH {2}
    '''.format(int(len(imagekernel_GaussianBlurGrayscalePyramid)/2), shapeX, shapeY)  + clkernel_GaussianBlurGrayscalePyramid
    clProgV.append(cl.Program(clContext, clkernel).build())

    clKrnlH.append(clProgH[-1].convolveGrayscalePyramid)
    clKrnlV.append(clProgV[-1].convolveGrayscalePyramid)

    # Temporary buffers
    npTempBuffer.append(np.empty((shapeY, shapeX), dtype = np.uint8))
    clTempBuffer.append(cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npTempBuffer[-1]))

    # Subsampling kernels

    clkernel = '''
        #define IMAGEHEIGHT {0}
        #define IMAGEWIDTH {1}
    '''.format(shapeX, shapeY)  + clkernel_SubsampleGrayscalePyramid
    clProgSubsample.append(cl.Program(clContext, clkernel).build())
    clKrnlSubsample.append(clProgSubsample[-1].subsampleGrayscalePyramid)

    if shapeX % 2 == 0:
        shapeX = shapeX + 1
    if shapeY % 2 == 0:
        shapeY = shapeY + 1
    shapeX = int(shapeX / 2)
    shapeY = int(shapeY / 2)
    # curSize = int(curSize / 4)
    curSize = (shapeX * shapeY) * 4


# Now execute the kernels on a single input image ...

durations = []
showDebugPlots = False

targetOffset_X = 0
targetOffset_Y = 0

for iIteration in range(500):
    print(f"{iIteration} ... ", end = "")
    stime = time()

    # Grayscale kernel is only required once. Also grayscaling only runs once ...
    krnlGray(clQueue, (w, h,), None , clBufferIn, clBuffer_Scales[0])
    if showDebugPlots:
        cl.enqueue_copy(clQueue, npBuffer_Scales[0], clBuffer_Scales[0]) #, origin=(0,0), region=(w,h))
        plt.imshow(npBuffer_Scales[0].reshape((h*2, w*2)))
        plt.show()

    for iScale in range(nScales):
        clKrnlH[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(0), np.int32(0), np.int32(0), np.int32(1))
        clKrnlV[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(0), np.int32(0), np.int32(0), np.int32(1))

        clKrnlH[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(0), np.int32(1), np.int32(1), np.int32(0))
        clKrnlV[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(0), np.int32(1), np.int32(1), np.int32(0))

        clKrnlH[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(1), np.int32(0), np.int32(1), np.int32(1))
        clKrnlV[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(1), np.int32(0), np.int32(1), np.int32(1))

        cl.enqueue_copy(clQueue, npBuffer_Scales[iScale], clBuffer_Scales[iScale]) #, origin=(0,0), region=(w,h))

        if showDebugPlots:
            plt.imshow(npBuffer_Scales[iScale].reshape((h*2, w*2)))
            plt.show()

        if iScale != (nScales - 1):
            clKrnlSubsample[iScale](clQueue, (int(w / 2), int(h / 2),), None, clBuffer_Scales[iScale], clBuffer_Scales[iScale + 1])
        #    if showDebugPlots:
        #        cl.enqueue_copy(clQueue, npBuffer_Scales[iScale + 1], clBuffer_Scales[iScale + 1])
        #        plt.imshow(npBuffer_Scales[iScale + 1].reshape((h, w)))

        # Perform difference calculation
        krnlDifference(clQueue, (w, h,), None, clBuffer_Scales[iScale], clPyramidBuffer, np.int32(targetOffset_X), np.int32(targetOffset_Y), np.int32(2*h), np.int32(2*w))

        targetOffset_X = targetOffset_X + h
        targetOffset_Y = targetOffset_Y + w

        w = int(w / 2)
        h = int(h / 2)

    etime = time()
    durations.append((etime - stime) * 1000)

# Fetch LAST Pyramid into host buffer ...
cl.enqueue_copy(clQueue, npPyramidBuffer, clPyramidBuffer)
fig, ax = plt.subplots(figsize = (6.4 * 4, 4.8 * 4))
npPyramidBuffer = (((npPyramidBuffer - np.min(npPyramidBuffer)) / np.max(npPyramidBuffer)) * 255.0).astype(np.uint8)
ax.imshow(npPyramidBuffer.reshape((image_input.shape[0] * 2, image_input.shape[1] * 2)))
#plt.savefig("imagepyramids2_initialpyramid.png")
plt.show()

fig, ax = plt.subplots()
ax.set_xlabel("Iteration")
ax.set_ylabel("Duration [ms]")
ax.plot(durations)
ax.grid()
ax.set_title(f"DoG pyramid, {nScales} scales (average {np.mean(durations)} ms)")
#plt.savefig("imagepyramids2_initialpyramid_timing.png")
#plt.savefig("imagepyramids2_initialpyramid_timing.svg")
plt.show()

Result of the grayscale difference of Gaussian pyramid test

Timing of the kernel runs

In the end one has to clean up allocated resources again:

clBufferKernel.release()
clPyramidBuffer.release()
clBufferIn.release()

for i in range(nScales):
    clBuffer_Scales[i].release()
    clBuffer_Scales[i] = None
    clProgH[i] = None
    clProgV[i] = None
    clKrnlH[i] = None
    clKrnlV[i] = None
    clProgSubsample[i] = None
    clKrnlSubsample[i] = None
    clTempBuffer[i].release()
    clTempBuffer[i] = None

Conclusion

This blog article has demonstrated that it’s pretty simple to calculate Lowes image pyramid schema on the GPU using OpenCL. It’s by far not the most performant (I would argue maybe one of the worst) implementations but it can provide a base for optimizing from a working implementation by modifying one kernel after each other. This blog article shows a testbed in which one can play around with image pyramids as well as the optimization of kernels in a very simple way. All kernels have been written in a way to be as readable as possible.

References

This article is tagged: Programming, Math, Basics, Python, Data Mining, Computer Vision, Machine learning, Gaussian


Data protection policy

Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)

This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/

Valid HTML 4.01 Strict Powered by FreeBSD IPv6 support