06 Aug 2023 - tsp
Last update 06 Aug 2023
22 mins
The following blog article summarizes a quick implementation of a Difference of Gaussian (DoG) pyramid as approximation of a Laplacian of a Gaussian pyramid. It does not utilize optimizations so performance (around 10ms for a simple 6 octave pyramid) is rather low. Of course all timings taken here are dependent on the CPU and GPU that has been used. To generate the graphs and timing shown in this blog article an AMD Radeon R9 360 has been used with a Xeon E3 host CPU on FreeBSD 13.
Image Pyramids are often used during machine vision. They usually are used to create multi-scale image representations by repeated application of downsampling. To reduce resolution of a picture one usually performs subsampling after applying (Gaussian) blur. The blurring step is necessary to prevent total loss of high frequency information (for example think about an binary image where every second pixel in every second row is set to white and every other to black where you downsample by a factor of two - then you only would get a single image. When applying blur it would be gray afterwards). This still forms kind of a low pass since one looses high frequency information inside the images - and it forms the different scale spaces. Depending on application one can also device to not perform blurring but applying the mean to build the downsampled version - how one does this depends entirely on the application (calculating a mean for 9 pixels for example would only require only 8 additions and one multiplication, applying an 3x3 Gaussian blur would require 9 multiplications and 8 additions so way more operations).
Inside the scale spaces one often performs bandpass filtering or searching for maxima and minima. This is usually done by calculating the derivative of the image. This can be approximated by applying a Gaussian blur multiple times and calculating differences between the resulting images:
[ f(\sigma_1; \sigma_2) = I * \frac{1}{\sigma_1 \sqrt{2 \pi}} e^{-\frac{x^2 - y^2}{2 \sigma_1^2}} - I * \frac{1}{\sigma_2 \sqrt{2 \pi}} e^{-\frac{x^2 - y^2}{2 \sigma_2^2}} ]Even though one performs a convolution with the image this yields
[ f(\sigma_1; \sigma_2) = I * \left( \frac{1}{\sigma_1 \sqrt{2 \pi}} e^{-\frac{x^2 - y^2}{2 \sigma_1^2}} - \frac{1}{\sigma_2 \sqrt{2 \pi}} e^{-\frac{x^2 - y^2}{2 \sigma_2^2}} \right) ]By building those differences from can approximate the Laplacian of a Gaussian well enough - this is an approach that’s often called Lowe’s Pyramid scheme:
Since one then only wants to perform scale space maxima one can show that it’s sufficient to calculate $3$ difference levels inside each octave - this can be verified by checking how many stable keypoints one can find when extending the number of scale spaces which has been done extensively in literature.
It has been shown previously how to implement convolution more efficient on the GPU than on the CPU. The convolution kernel had been pretty simple. Now the next step will be to implement the whole subsampling and blurring for the scale spaces as well as the difference of Gaussian pyramids on the GPU.
What’s necessary to build a DoG Pyramid?
The first step is to initialize the OpenCL framework. Since I’m going to test the
kernels in Python this code uses pyopencl
. The very simple testbench simply
fetches a list of all platforms and .their devices and then selects a predefined
one (usedPlatform
and usedDevice
). Since this code is only thought
as a testbench for the OpenCL code there is no external configuration option.
A detailed description can be found in a previous blog article
The initialization creates or initializes:
import pyopencl as cl
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from time import time
platforms = cl.get_platforms()
for idxPlatform, platform in enumerate(platforms):
devices = platform.get_devices()
print(f"Platform {platform.name}:")
for idxDevice, dev in enumerate(devices):
print(f" {idxDevice}: {dev.name}")
usedPlatform = 0
usedDevice = 0
clPlatform = platforms[usedPlatform]
clDevice = clPlatform.get_devices()[usedDevice]
clContext = cl.Context([ clDevice ])
clQueue = cl.CommandQueue(clContext, clDevice)
The first step for most image processing algorithms is grayscaling. This of course
reduces the amount of data and especially the complexity - but also reduces the
dimensionality and amount of information in your data. The grayscale kernel will
accept an RGBA image with 4 bytes per pixel, one byte per channel - the alpha
channel will be ignored. It then grayscales with the factors specified
in ITU-R BT.709
standard that’s also used in HDTV to compute the luma
component. It then clamps the output value to the 0 to 255 range and writes it
into a simple output array. For a quick test it’s assumed all images are RGBA
images - in a real world scenario one will have to support a variety of input
formats (at least RGB, RGBA and YUV frames as well as accepting already
grayscaled images).
The ITU-R BT.709
formula assigns the luma component as:
$Y = 0.2126 * R + 0.7152 * G + 0.0722 * B$
This tries to represent the sensitivity of the human eye to different colors in some way. Since all pixels are independent this is a very simple kernel:
clkernel_Grayscale_Tpl = '''//CL//
#ifndef GRAYSCALE_FACTOR_R
#define GRAYSCALE_FACTOR_R 0.2126
#endif
#ifndef GRAYSCALE_FACTOR_G
#define GRAYSCALE_FACTOR_G 0.7152
#endif
#ifndef GRAYSCALE_FACTOR_B
#define GRAYSCALE_FACTOR_B 0.0722
#endif
__kernel void grayscale_RGBA2Pyrabuffer(
__global unsigned char* lpRGBAIn,
__global unsigned char* lpPyraBufferOut
) {
int y = get_global_id(0);
int x = get_global_id(1);
float gray_value = ((float)(lpRGBAIn[(y + x * IMAGEHEIGHT) * 4 + 0])) * GRAYSCALE_FACTOR_R + ((float)(lpRGBAIn[(y + x * IMAGEHEIGHT) * 4 + 1])) * GRAYSCALE_FACTOR_G + ((float)(lpRGBAIn[(y + x * IMAGEHEIGHT) * 4 + 2])) * GRAYSCALE_FACTOR_B;
if (gray_value > 255) {
gray_value = 255.0;
}
if (gray_value < 1) {
gray_value = 0;
}
lpPyraBufferOut[(y + x * 2 * IMAGEHEIGHT)] = (unsigned char)gray_value;
}
'''
#
# clkernel_Grayscale = '''
# #define IMAGEHEIGHT {0}
# #define IMAGEWIDTH {1}
# '''.format(image_input.shape[1], image_input.shape[0]) + clkernel_Grayscale_Tpl
#
# clProg_Grayscale = cl.Program(clContext, clkernel_Grayscale).build()
This kernel is used to perform convolution with a Gaussian in horizontal and vertical direction only and performs simple repeating of border pixels - different modes will require a different implementation, for example using preprocessor definitions. It reads from the same buffer as it writes into and only accepts a source and target index in the horizontal and vertical direction (which is used to apply the convolution only to a window inside a buffer to allow all images to be stored side by side - in the same buffer object) as well as the selector for the horizontal or vertical convolution. The simple implementation of a convolution is described in a previous blog article.
In addition a short Python function is supplied that initializes the buffer containing the Gaussian kernel constants. Those are calculated on the host instead of being calculated on the GPU on the fly.
def gaussiankernel(width, sigma):
if width % 2 != 1:
raise ValueError("Require odd kernel size")
r = np.empty((width,))
for k in range(int(width/2) + 1):
r[int(width/2) + k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))
r[int(width/2) - k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))
# Apply norm ...
r = r / np.sum(r)
return r
#imagekernel_GaussianBlurGrayscalePyramid = gaussiankernel(5, np.sqrt(2))
imagekernel_GaussianBlurGrayscalePyramid = gaussiankernel(7, np.sqrt(2))
clkernel_GaussianBlurGrayscalePyramid = '''//CL//
__kernel void convolveGrayscalePyramid(
__global unsigned char* lpBuffer,
__global unsigned char* lpTempBuffer,
__constant float* lpKernel,
int sourceX, int sourceY,
int destX, destY
) {
int x = get_global_id(1);
int y = get_global_id(0);
float v = 0.0;
/* Read the pixel values and apply Gaussian weights */
for(int i = -KERNELHALFWIDTH; i <= KERNELHALFWIDTH; i=i+1) {
#if defined(CONVOLVE_HORIZONTAL)
int realx = x + i;
if(realx < 0) {
realx = 0;
} else if(realx >= IMAGEWIDTH) {
realx = IMAGEWIDTH-1;
}
float kv = lpKernel[i + KERNELHALFWIDTH];
realx = realx + sourceX * IMAGEWIDTH;
int realy = y + sourceY * IMAGEHEIGHT;
v = v + lpBuffer[realy + realx * 2 * IMAGEHEIGHT] * kv;
#else
int realy = y + i;
if(realy < 0) {
realy = 0;
} else if(realy >= IMAGEHEIGHT) {
realy = IMAGEHEIGHT-1;
}
float kv = lpKernel[i + KERNELHALFWIDTH];
v = v + lpTempBuffer[realy + x * IMAGEHEIGHT] * kv;
#endif
}
if(v > 255) {
v = 255;
} else if(v < 1) {
v = 0;
}
#if defined(CONVOLVE_HORIZONTAL)
lpTempBuffer[y + (x * IMAGEHEIGHT)] = (unsigned char)v;
#else
lpBuffer[(y + (destY * IMAGEHEIGHT)) + ((x + (destX * IMAGEWIDTH)) * 2 * IMAGEHEIGHT)] = (unsigned char)v;
#endif
}
'''
# clkernel = '''
# #define KERNELHALFWIDTH {0}
# #define IMAGEHEIGHT {1}
# #define IMAGEWIDTH {2}
# #define CONVOLVE_HORIZONTAL 1
# '''.format(int(len(imagekernel)/2), image_input.shape[1], image_input.shape[0]) + clkernel
# clProgH = cl.Program(clContext, clkernel).build()
# clkernel = '''
# #define KERNELHALFWIDTH {0}
# #define IMAGEHEIGHT {1}
# #define IMAGEWIDTH {2}
# '''.format(int(len(imagekernel)/2), image_input.shape[1], image_input.shape[0]) + clkernel
# clProgV = cl.Program(clContext, clkernel).build()
To decrease the resolution of an image it will be first blurred using a Gaussian (since this will also be done for the scale spaces anyways) - then one can perform simple subsampling. The subsampling kernel reduces the resolution of an image by a factor of two in each direction. It always reads from the third image in the source buffer and writes into the first image in the target buffer. It will simply get launched for half the number of input pixels - again every pixel is independent of each other so the kernel is very simple:
clkernel_SubsampleGrayscalePyramid = '''//CL//
__kernel void subsampleGrayscalePyramid(
__global unsigned char* lpPyramidLevelIn,
__global unsigned char* lpPyramidLevelOut
) {
int x = get_global_id(1);
int y = get_global_id(0);
// We source from the third image ...
float v = lpPyramidLevelIn[(y*2) + IMAGEHEIGHT + ((x*2) * 2 * IMAGEHEIGHT)];
// And write into the first in the output ... not using IMAGEHEIGHT >> 2 since requiring * 2 again
lpPyramidLevelOut[y + (x * (IMAGEHEIGHT))] = v;
}
'''
# clkernel = '''
# #define IMAGEHEIGHT {0}
# #define IMAGEWIDTH {1}
# '''.format(image_input.shape[1], image_input.shape[0]) + clkernel
# clProgSubsample = cl.Program(clContext, clkernel).build()
The last step for a Difference of Gaussian (DoG) pyramid is the difference step. It will subtract all scale images in a single octave at once and write three different images at the same time. This might not be the wisest implementation due to caching issues - it’s just one of the most simple ones and enough for a first try. The kernel accepts the octaves buffer as well as the single difference map output buffer. In addition the source width and height is specified together with the target offset to locate the position of the fourth image of the previous octaves difference images.
# Definitions:
# IMAGEHEIGHT_DIFFPYRAMID
# IMAGEWIDTH_DIFFPYRAMID
#define IMAGEHEIGHT_DIFFPYRAMID {0}
#define IMAGEWIDTH_DIFFPYRAMID {1}
clkernel_DifferencePyramidDifference = '''//CL//
__kernel void grayscalePyramidDifference(
__global unsigned char* lpPyramidLevelIn,
__global unsigned char* lpDifferenceMapOut,
int targetOffset_X, int targetOffset_Y,
int sourceWidth, int sourceHeight
) {
int x = get_global_id(1);
int y = get_global_id(0);
float v1 = lpPyramidLevelIn[y + (x * sourceHeight)];
float v2 = lpPyramidLevelIn[y + (sourceHeight / 2) + (x * sourceHeight)];
float v3 = lpPyramidLevelIn[y + ((x + sourceWidth / 2) * sourceHeight)];
float v4 = lpPyramidLevelIn[y + sourceHeight / 2 + ((x + sourceWidth / 2) * sourceHeight)];
lpDifferenceMapOut[(y + targetOffset_Y) + (x + targetOffset_X) * IMAGEHEIGHT_DIFFPYRAMID] = v3 - v1;
lpDifferenceMapOut[(y + targetOffset_Y + (sourceHeight >> 1)) + (x + targetOffset_X) * IMAGEHEIGHT_DIFFPYRAMID] = v2 - v3;
lpDifferenceMapOut[(y + targetOffset_Y) + (x + targetOffset_X + (sourceWidth >> 1)) * IMAGEHEIGHT_DIFFPYRAMID] = v4 - v2;
}
'''
The following is just a quick run of all kernels after each other to check they really work as expected before assembling a whole pyramid calculation function
# Load image
image_input = np.asarray(Image.open("imagetest01.jpg").convert('RGBA'), dtype = np.uint8)
h, w, c = image_input.shape[0], image_input.shape[1], image_input.shape[2]
print(f"Source image has dimensions {h} x {w} x {c}")
# Compile our kernel
clkernel_Grayscale = '''
#define IMAGEHEIGHT {0}
#define IMAGEWIDTH {1}
'''.format(image_input.shape[1], image_input.shape[0]) + clkernel_Grayscale_Tpl
clProg_Grayscale = cl.Program(clContext, clkernel_Grayscale).build()
# Image input buffer
clBufferIn = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = image_input)
# Would require / 4 for the ARGB -> Grayscale and * 4 for the 4 images so the size stays the same ...
npBuffer_Scales = [
np.empty(clBufferIn.get_info(cl.mem_info.SIZE), dtype=np.uint8),
np.empty(int(clBufferIn.get_info(cl.mem_info.SIZE) / 4), dtype=np.uint8),
]
clBuffer_Scales = [
cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npBuffer_Scales[0]),
cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npBuffer_Scales[1])
]
krnlGray = clProg_Grayscale.grayscale_RGBA2Pyrabuffer
krnlGray(clQueue, (w, h,), None , clBufferIn, clBuffer_Scales[0])
cl.enqueue_copy(clQueue, npBuffer_Scales[0], clBuffer_Scales[0]) #, origin=(0,0), region=(w,h))
plt.imshow(npBuffer_Scales[0].reshape((h*2, w*2)))
# Release buffers
#for buf in clBuffer_Scales:
# buf.release()
clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
#define CONVOLVE_HORIZONTAL 1
'''.format(int(len(imagekernel_GaussianBlurGrayscalePyramid)/2), image_input.shape[1], image_input.shape[0]) + clkernel_GaussianBlurGrayscalePyramid
clProgH = cl.Program(clContext, clkernel).build()
clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
'''.format(int(len(imagekernel_GaussianBlurGrayscalePyramid)/2), image_input.shape[1], image_input.shape[0]) + clkernel_GaussianBlurGrayscalePyramid
clProgV = cl.Program(clContext, clkernel).build()
krnlH = clProgH.convolveGrayscalePyramid
krnlV = clProgV.convolveGrayscalePyramid
npKernel = np.asarray(imagekernel_GaussianBlurGrayscalePyramid, dtype = np.float32)
clBufferKernel = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = npKernel)
npTempBuffer = np.empty((h, w), dtype = np.uint8)
clTempBuffer = cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npTempBuffer)
krnlH(clQueue, (w, h,), None , clBuffer_Scales[0], clTempBuffer, clBufferKernel, np.int32(0), np.int32(0), np.int32(0), np.int32(1))
krnlV(clQueue, (w, h,), None , clBuffer_Scales[0], clTempBuffer, clBufferKernel, np.int32(0), np.int32(0), np.int32(0), np.int32(1))
krnlH(clQueue, (w, h,), None , clBuffer_Scales[0], clTempBuffer, clBufferKernel, np.int32(0), np.int32(1), np.int32(1), np.int32(0))
krnlV(clQueue, (w, h,), None , clBuffer_Scales[0], clTempBuffer, clBufferKernel, np.int32(0), np.int32(1), np.int32(1), np.int32(0))
krnlH(clQueue, (w, h,), None , clBuffer_Scales[0], clTempBuffer, clBufferKernel, np.int32(1), np.int32(0), np.int32(1), np.int32(1))
krnlV(clQueue, (w, h,), None , clBuffer_Scales[0], clTempBuffer, clBufferKernel, np.int32(1), np.int32(0), np.int32(1), np.int32(1))
cl.enqueue_copy(clQueue, npBuffer_Scales[0], clBuffer_Scales[0]) #, origin=(0,0), region=(w,h))
plt.imshow(npBuffer_Scales[0].reshape((h*2, w*2)))
clkernel = '''
#define IMAGEHEIGHT {0}
#define IMAGEWIDTH {1}
'''.format(image_input.shape[1], image_input.shape[0]) + clkernel_SubsampleGrayscalePyramid
clProgSubsample = cl.Program(clContext, clkernel).build()
krnlSubsample = clProgSubsample.subsampleGrayscalePyramid
krnlSubsample(clQueue, (int(w / 2), int(h / 2),), None, clBuffer_Scales[0], clBuffer_Scales[1])
cl.enqueue_copy(clQueue, npBuffer_Scales[1], clBuffer_Scales[1])
plt.imshow(npBuffer_Scales[1].reshape((h, w)))
#clBufferKernel.release()
for buf in clBuffer_Scales:
buf.release()
Since the kernels seem to work - now let’s build the whole framework for a pyramid and do some performance tests of this very simple implementation. It will just load the input image using Pillow, convert it into a native array using numpy and transfer it onto the GPU. Note that the data transfer to and from the GPU will not be counted in the performance evaluation. This is done since it’s assumed that the images are already available on the GPU when executing such an task and that there will be more processing being done on the GPU before transferring the images back.
The test program works very simple:
grayscale_RGBA2Pyrabuffer
reference will be fetchedThen comes the execution phase. For each repeated iteration - which are done to get a more clean performance measurement:
The execution phase is times multiple times to get a rough feeling about the performance of the OpenCL implementation
nScales = 6
showDebugPlots = True
# Load image
#image_input = np.asarray(Image.open("imagetest01.jpg").convert('RGBA'), dtype = np.uint8)
image_input = np.asarray(Image.open("butterfly_external_testimage.jpg").convert('RGBA'), dtype = np.uint8)
h, w, c = image_input.shape[0], image_input.shape[1], image_input.shape[2]
print(f"Source image has dimensions {h} x {w} x {c}")
# Compile our kernel
clkernel_Grayscale = '''
#define IMAGEHEIGHT {0}
#define IMAGEWIDTH {1}
'''.format(image_input.shape[1], image_input.shape[0]) + clkernel_Grayscale_Tpl
clProg_Grayscale = cl.Program(clContext, clkernel_Grayscale).build()
krnlGray = clProg_Grayscale.grayscale_RGBA2Pyrabuffer
# Image input buffer
clBufferIn = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = image_input)
# Pyramid output buffer
npPyramidBuffer = np.empty((image_input.shape[1] * 2, image_input.shape[0] * 2), dtype = np.uint8)
clPyramidBuffer = cl.Buffer(clContext, cl.mem_flags.WRITE_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = npPyramidBuffer)
# Difference Kernel
clkernel_Difference = '''
#define IMAGEHEIGHT_DIFFPYRAMID {0}
#define IMAGEWIDTH_DIFFPYRAMID {1}
'''.format(int(image_input.shape[1] * 2), int(image_input.shape[0] * 2)) + clkernel_DifferencePyramidDifference
clProg_Difference = cl.Program(clContext, clkernel_Difference).build()
krnlDifference = clProg_Difference.grayscalePyramidDifference
# Would require / 4 for the ARGB -> Grayscale and * 4 for the 4 images so the size stays the same ...
curSize = clBufferIn.get_info(cl.mem_info.SIZE)
shapeX = image_input.shape[1]
shapeY = image_input.shape[0]
clBuffer_Scales = []
npBuffer_Scales = []
clProgH = []
clProgV = []
clKrnlH = []
clKrnlV = []
clProgSubsample = []
clKrnlSubsample = []
npTempBuffer = []
clTempBuffer = []
for iScale in range(nScales):
print(f"Scale {iScale}: {shapeX} x {shapeY} -> {curSize}")
npBuffer_Scales.append(np.empty(curSize, dtype=np.uint8))
clBuffer_Scales.append(cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npBuffer_Scales[-1]))
# Blurring kernels
clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
#define CONVOLVE_HORIZONTAL 1
'''.format(int(len(imagekernel_GaussianBlurGrayscalePyramid)/2), shapeX, shapeY) + clkernel_GaussianBlurGrayscalePyramid
clProgH.append(cl.Program(clContext, clkernel).build())
clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
'''.format(int(len(imagekernel_GaussianBlurGrayscalePyramid)/2), shapeX, shapeY) + clkernel_GaussianBlurGrayscalePyramid
clProgV.append(cl.Program(clContext, clkernel).build())
clKrnlH.append(clProgH[-1].convolveGrayscalePyramid)
clKrnlV.append(clProgV[-1].convolveGrayscalePyramid)
# Temporary buffers
npTempBuffer.append(np.empty((shapeY, shapeX), dtype = np.uint8))
clTempBuffer.append(cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npTempBuffer[-1]))
# Subsampling kernels
clkernel = '''
#define IMAGEHEIGHT {0}
#define IMAGEWIDTH {1}
'''.format(shapeX, shapeY) + clkernel_SubsampleGrayscalePyramid
clProgSubsample.append(cl.Program(clContext, clkernel).build())
clKrnlSubsample.append(clProgSubsample[-1].subsampleGrayscalePyramid)
if shapeX % 2 == 0:
shapeX = shapeX + 1
if shapeY % 2 == 0:
shapeY = shapeY + 1
shapeX = int(shapeX / 2)
shapeY = int(shapeY / 2)
# curSize = int(curSize / 4)
curSize = (shapeX * shapeY) * 4
# Now execute the kernels on a single input image ...
durations = []
showDebugPlots = False
targetOffset_X = 0
targetOffset_Y = 0
for iIteration in range(500):
print(f"{iIteration} ... ", end = "")
stime = time()
# Grayscale kernel is only required once. Also grayscaling only runs once ...
krnlGray(clQueue, (w, h,), None , clBufferIn, clBuffer_Scales[0])
if showDebugPlots:
cl.enqueue_copy(clQueue, npBuffer_Scales[0], clBuffer_Scales[0]) #, origin=(0,0), region=(w,h))
plt.imshow(npBuffer_Scales[0].reshape((h*2, w*2)))
plt.show()
for iScale in range(nScales):
clKrnlH[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(0), np.int32(0), np.int32(0), np.int32(1))
clKrnlV[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(0), np.int32(0), np.int32(0), np.int32(1))
clKrnlH[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(0), np.int32(1), np.int32(1), np.int32(0))
clKrnlV[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(0), np.int32(1), np.int32(1), np.int32(0))
clKrnlH[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(1), np.int32(0), np.int32(1), np.int32(1))
clKrnlV[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(1), np.int32(0), np.int32(1), np.int32(1))
cl.enqueue_copy(clQueue, npBuffer_Scales[iScale], clBuffer_Scales[iScale]) #, origin=(0,0), region=(w,h))
if showDebugPlots:
plt.imshow(npBuffer_Scales[iScale].reshape((h*2, w*2)))
plt.show()
if iScale != (nScales - 1):
clKrnlSubsample[iScale](clQueue, (int(w / 2), int(h / 2),), None, clBuffer_Scales[iScale], clBuffer_Scales[iScale + 1])
# if showDebugPlots:
# cl.enqueue_copy(clQueue, npBuffer_Scales[iScale + 1], clBuffer_Scales[iScale + 1])
# plt.imshow(npBuffer_Scales[iScale + 1].reshape((h, w)))
# Perform difference calculation
krnlDifference(clQueue, (w, h,), None, clBuffer_Scales[iScale], clPyramidBuffer, np.int32(targetOffset_X), np.int32(targetOffset_Y), np.int32(2*h), np.int32(2*w))
targetOffset_X = targetOffset_X + h
targetOffset_Y = targetOffset_Y + w
w = int(w / 2)
h = int(h / 2)
etime = time()
durations.append((etime - stime) * 1000)
# Fetch LAST Pyramid into host buffer ...
cl.enqueue_copy(clQueue, npPyramidBuffer, clPyramidBuffer)
fig, ax = plt.subplots(figsize = (6.4 * 4, 4.8 * 4))
npPyramidBuffer = (((npPyramidBuffer - np.min(npPyramidBuffer)) / np.max(npPyramidBuffer)) * 255.0).astype(np.uint8)
ax.imshow(npPyramidBuffer.reshape((image_input.shape[0] * 2, image_input.shape[1] * 2)))
#plt.savefig("imagepyramids2_initialpyramid.png")
plt.show()
fig, ax = plt.subplots()
ax.set_xlabel("Iteration")
ax.set_ylabel("Duration [ms]")
ax.plot(durations)
ax.grid()
ax.set_title(f"DoG pyramid, {nScales} scales (average {np.mean(durations)} ms)")
#plt.savefig("imagepyramids2_initialpyramid_timing.png")
#plt.savefig("imagepyramids2_initialpyramid_timing.svg")
plt.show()
In the end one has to clean up allocated resources again:
clBufferKernel.release()
clPyramidBuffer.release()
clBufferIn.release()
for i in range(nScales):
clBuffer_Scales[i].release()
clBuffer_Scales[i] = None
clProgH[i] = None
clProgV[i] = None
clKrnlH[i] = None
clKrnlV[i] = None
clProgSubsample[i] = None
clKrnlSubsample[i] = None
clTempBuffer[i].release()
clTempBuffer[i] = None
This blog article has demonstrated that it’s pretty simple to calculate Lowes image pyramid schema on the GPU using OpenCL. It’s by far not the most performant (I would argue maybe one of the worst) implementations but it can provide a base for optimizing from a working implementation by modifying one kernel after each other. This blog article shows a testbed in which one can play around with image pyramids as well as the optimization of kernels in a very simple way. All kernels have been written in a way to be as readable as possible.
This article is tagged: Programming, Math, Basics, Python, Data Mining, Computer Vision, Machine learning, Gaussian
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/