I'm working on an implementation of the "log transform" operator on an image for C++, and we currently have it formulated as follows (python code). Note we use log10 instead of the natural log.
def log_transform(mat: np.ndarray, dB: int = 40):
mat= 255 * (20/dB * np.log10(mat/ mat.max()) + 1)
mat[mat< 1] = 0
return mat
Now in OpenCV and C++, I have the following implementation and I'm not happy with its performance. What's confusing to me is that I have another function that does a Hilbert transform on the same image (fft, ifft, copying into buffers etc.) and it has the same average run time as this log_transform
function, which should be O(n)
and transforms in-place.
void log_transform(cv::Mat& mat, double db) {
assert(mat.isContinuous());
double maxval;
cv::minMaxIdx(mat, (double*)0, &maxval);
// OpenCV only implements natural log, and we need log10
double* const ptr = (double*)mat.data;
const double fct = 20 / db;
#pragma omp parallel for private(i, v)
for (int i = 0; i < mat.rows * mat.cols; i++) {
double& v = *(ptr + i);
v = 255. * (fct * std::log10(v / maxval) + 1.);
v = v < 1. ? 0. : v;
}
}
Interestingly, using #pragma omp parallel for
doesn't improve the average runtime (maybe 10% slow down). I set omp_set_num_threads
to the number of physical cores I have, not number of virtual cores with hyperthreading. (Setting it to the number of virtual cores actually degrades performance even more, probably due to the sync overhead). I've also tried to tune schedule(static, CHUNK_SIZE)
with no success.
From profilers (I used both Visual Studio performance profiler and Intel VTune), most of the CPU time is spent in std::log10
. I'm not sure what else to look for in the profiler output to help me improve this function.
On an image of input size (1000, 6400) the runtimes:
- Python:
146 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
- C++:
108 ms ± 8.72 (mean ± std. dev. of 15 runs)
Compiler: Visual Studio 2019 (Version 16.11.23, which I believe implements OpenMP 2.x). MSVC and Clang give similar results.
CPU: Intel i5 Haswell
My questions:
- How can I improve the performance of this
log_transform
function? I thought about SIMD but I'm not sure how to dolog10
. - Why doesn't OpenMP improve the performance of the loop?
Edit: after enabling the /openmp
compiler flag
As suggested by Doc Brown
, I also set the loop range nrows
and ncols
outside the loop definition, in case the compiler doesn't know they're constant.
Raw OpenMP version:
void log_transform(cv::Mat& mat, double db) {
assert(mat.isContinuous());
double maxval;
cv::minMaxIdx(mat, (double*)0, &maxval);
const double fct = 20. / db;
const int nrows = mat.rows;
const int ncols = mat.cols;
#pragma omp parallel for
for (int i = 0; i < nrows; ++i) {
auto row = (double*)mat.ptr(i);
for (int j = 0; j < ncols; ++j) {
double& v = row[j];
v = 255. * (fct * std::log10(v / maxval) + 1.);
v = v < 1.0 ? 0 : v;
}
}
}
OpenCV parallel_for_
version
void log_transform(cv::Mat& mat, double db) {
assert(mat.isContinuous());
double maxval;
cv::minMaxIdx(mat, (double*)0, &maxval);
const double fct = 20. / db;
const int nrows = mat.rows;
const int ncols = mat.cols;
// OpenCV only implements natural log, and we need log10
auto par_log_transform = [&mat, maxval, fct, ncols](const cv::Range& range) {
for (int i = range.start; i < range.end; ++i) {
auto row = (double*)mat.ptr(i);
for (int j = 0; j < ncols; ++j) {
double& v = row[j];
v = 255. * (fct * std::log10(v / maxval) + 1.);
v = v < 1.0 ? 0 : v;
}
}
};
cv::parallel_for_(cv::Range(0, nrows), par_log_transform);
}
Use natural log as suggested by user Useless
:
void log_transform(cv::Mat& mat, double db) {
assert(mat.isContinuous());
double maxval;
cv::minMaxIdx(mat, (double*)0, &maxval);
cv::log(mat / maxval, mat);
mat = 255. * (20. / db / cv::log(10) * mat + 1.);
}
Runtime:
- raw OpenMP:
35.8 ± 4.7 ms (15 runs)
- OpenCV
parallel_for_
:29.1 ± 5.4 ms (15 runs)
- Natural log version:
89.0 ± 12.8 ms (15 runs)
I'm not sure which backend OpenCV's parallel_for_
uses (it apparently supports a lot of backends, such as OpenMP, Intel TBB...) but it seems that it optimizes better than my custom OpenMP loop.
The current winner is OpenCV's parallel_for_
.
/openmp
in the compiler flag, and apparently theomp.h
header just stubs it out and the compiler simply ignores all theomp
directives...double
matrix, but you mentioned this is image data. In what range are the values in this matrix? And are these just grey scale values? I ask because with some range constraints it might be possible to replacestd::log10
by a lookup table.log10(x) = log2(x) / log2(10)
, right? You can just pre-computelog2(10)
and use fastlog2
operations.double v = row[j];
, then the calculation, and finally row[j] = v; Not sure what will happen, but there might be a chance the compiler will produce some code which keeps the intermediate values longer in a CPU register, and not store them back too early to memory.