Speed up implementation of "log transform" on 2D image

Question 1

I'm working on an implementation of the "log transform" operator on an image for C++, and we currently have it formulated as follows (python code). Note we use log10 instead of the natural log.

def log_transform(mat: np.ndarray, dB: int = 40):
 mat= 255 * (20/dB * np.log10(mat/ mat.max()) + 1)
 mat[mat< 1] = 0
 return mat

Now in OpenCV and C++, I have the following implementation and I'm not happy with its performance. What's confusing to me is that I have another function that does a Hilbert transform on the same image (fft, ifft, copying into buffers etc.) and it has the same average run time as this log_transform function, which should be O(n) and transforms in-place.

 void log_transform(cv::Mat& mat, double db) {
 assert(mat.isContinuous());
 double maxval;
 cv::minMaxIdx(mat, (double*)0, &maxval);
 // OpenCV only implements natural log, and we need log10
 double* const ptr = (double*)mat.data;
 const double fct = 20 / db;
 #pragma omp parallel for private(i, v)
 for (int i = 0; i < mat.rows * mat.cols; i++) {
 double& v = *(ptr + i);
 v = 255. * (fct * std::log10(v / maxval) + 1.);
 v = v < 1. ? 0. : v;
 }
 }

Interestingly, using #pragma omp parallel for doesn't improve the average runtime (maybe 10% slow down). I set omp_set_num_threads to the number of physical cores I have, not number of virtual cores with hyperthreading. (Setting it to the number of virtual cores actually degrades performance even more, probably due to the sync overhead). I've also tried to tune schedule(static, CHUNK_SIZE) with no success.

From profilers (I used both Visual Studio performance profiler and Intel VTune), most of the CPU time is spent in std::log10. I'm not sure what else to look for in the profiler output to help me improve this function.

On an image of input size (1000, 6400) the runtimes:

Python: 146 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
C++: 108 ms ± 8.72 (mean ± std. dev. of 15 runs)

Compiler: Visual Studio 2019 (Version 16.11.23, which I believe implements OpenMP 2.x). MSVC and Clang give similar results.

CPU: Intel i5 Haswell

My questions:

How can I improve the performance of this log_transform function? I thought about SIMD but I'm not sure how to do log10.
Why doesn't OpenMP improve the performance of the loop?

Edit: after enabling the `/openmp` compiler flag

As suggested by Doc Brown, I also set the loop range nrows and ncols outside the loop definition, in case the compiler doesn't know they're constant.

Raw OpenMP version:

 void log_transform(cv::Mat& mat, double db) {
 assert(mat.isContinuous());
 double maxval;
 cv::minMaxIdx(mat, (double*)0, &maxval);
 const double fct = 20. / db;
 const int nrows = mat.rows;
 const int ncols = mat.cols;
 #pragma omp parallel for
 for (int i = 0; i < nrows; ++i) {
 auto row = (double*)mat.ptr(i);
 for (int j = 0; j < ncols; ++j) {
 double& v = row[j];
 v = 255. * (fct * std::log10(v / maxval) + 1.);
 v = v < 1.0 ? 0 : v;
 }
 }
 }

OpenCV parallel_for_ version

 void log_transform(cv::Mat& mat, double db) {
 assert(mat.isContinuous());
 double maxval;
 cv::minMaxIdx(mat, (double*)0, &maxval);
 const double fct = 20. / db;
 const int nrows = mat.rows;
 const int ncols = mat.cols;
 // OpenCV only implements natural log, and we need log10
 auto par_log_transform = [&mat, maxval, fct, ncols](const cv::Range& range) {
 for (int i = range.start; i < range.end; ++i) {
 auto row = (double*)mat.ptr(i);
 for (int j = 0; j < ncols; ++j) {
 double& v = row[j];
 v = 255. * (fct * std::log10(v / maxval) + 1.);
 v = v < 1.0 ? 0 : v;
 }
 }
 };
 cv::parallel_for_(cv::Range(0, nrows), par_log_transform);
 }

Use natural log as suggested by user Useless:

 void log_transform(cv::Mat& mat, double db) {
 assert(mat.isContinuous());
 double maxval;
 cv::minMaxIdx(mat, (double*)0, &maxval);
 cv::log(mat / maxval, mat);
 mat = 255. * (20. / db / cv::log(10) * mat + 1.);
 }

Runtime:

raw OpenMP: 35.8 ± 4.7 ms (15 runs)
OpenCV parallel_for_: 29.1 ± 5.4 ms (15 runs)
Natural log version: 89.0 ± 12.8 ms (15 runs)

I'm not sure which backend OpenCV's parallel_for_ uses (it apparently supports a lot of backends, such as OpenMP, Intel TBB...) but it seems that it optimizes better than my custom OpenMP loop.

The current winner is OpenCV's parallel_for_.

Question 2

I am trolling... I didn't enabled /openmp in the compiler flag, and apparently the omp.h header just stubs it out and the compiler simply ignores all the omp directives...

Question 3

Wouldn't this squarely fall under the "assistance for using coding tools" off-topic category?

Question 4

@kwsp: you are using a double matrix, but you mentioned this is image data. In what range are the values in this matrix? And are these just grey scale values? I ask because with some range constraints it might be possible to replace std::log10 by a lookup table.

Question 5

You know log10(x) = log2(x) / log2(10), right? You can just pre-compute log2(10) and use fast log2 operations.

Question 6

Another thing I would try is not to use a reference variable. Try out double v = row[j];, then the calculation, and finally row[j] = v; Not sure what will happen, but there might be a chance the compiler will produce some code which keeps the intermediate values longer in a CPU register, and not store them back too early to memory.

Stack Exchange Network

Speed up implementation of "log transform" on 2D image

Edit: after enabling the `/openmp` compiler flag

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Speed up implementation of "log transform" on 2D image

Edit: after enabling the /openmp compiler flag

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Edit: after enabling the `/openmp` compiler flag