2
\$\begingroup\$

I tried to parallelize the loop, and I got a good result but still not enough. This post is a follow up to a recent one where I optimized other parts of the code using a lookup table and spacial and temporal relationships. This is not included in the following code for simplification.

The loop in question is in hist function. I want your help if you have any suggestion to optimize the loop and run it faster?

I think it is now important to mention the hardware I'll be using. It will be Ambarella’s CV25. I know there exist some hardware optimizations such as SIMD, but I'm not very familiar with that low level programming but I'm open for any solutions.

Here are more details about the hardware:

enter image description here

#include <opencv2/opencv.hpp>
#include <iostream>
#include <vector>
// Structure to hold cached parameters
struct Cache {
 std::vector<int> data_b;
 std::vector<int> data_g;
 std::vector<int> data_r;
 std::vector<uchar> lut_b;
 std::vector<uchar> lut_g;
 std::vector<uchar> lut_r;
};
// Function to compute simple example data and lookup tables
void compute_data(const cv::Mat& image, Cache& cache)
{
 // Simple example to initialize data
 cache.data_b.assign(256, 1);
 cache.data_g.assign(256, 2);
 cache.data_r.assign(256, 3);
 // Compute lookup tables
 cache.lut_b.resize(256);
 cache.lut_g.resize(256);
 cache.lut_r.resize(256);
 for (int i = 0; i < 256; i++) {
 cache.lut_b[i] = static_cast<uchar>(i);
 cache.lut_g[i] = static_cast<uchar>(i);
 cache.lut_r[i] = static_cast<uchar>(i);
 }
}
void hist(cv::Mat& image, Cache& cache, bool use_cache)
{
 if (!use_cache) {
 compute_data(image, cache);
 }
 // Apply transformation using lookup tables in parallel
 cv::parallel_for_(cv::Range(0, image.rows), [&](const cv::Range& range) {
 for (int i = range.start; i < range.end; ++i)
 {
 cv::Vec3b* row = image.ptr<cv::Vec3b>(i);
 for (int j = 0; j < image.cols; ++j)
 {
 cv::Vec3b& pxi = row[j];
 pxi[0] = cache.lut_b[pxi[0]];
 pxi[1] = cache.lut_g[pxi[1]];
 pxi[2] = cache.lut_r[pxi[2]];
 }
 }
 });
}
int main(int argc, char** argv)
{
 // Open the video file
 cv::VideoCapture cap("../video.mp4");
 if (!cap.isOpened()) {
 std::cerr << "Error opening video file" << std::endl;
 return -1;
 }
 // Get the frame rate of the video
 double fps = cap.get(cv::CAP_PROP_FPS);
 int delay = static_cast<int>(1000 / fps);
 // Create a window to display the video
 cv::namedWindow("Processed Video", cv::WINDOW_NORMAL);
 cv::Mat frame;
 Cache cache;
 int frame_count = 0;
 int recompute_interval = 5; // Recompute every 5 frames
 while (true) {
 cap >> frame;
 if (frame.empty()) {
 break;
 }
 // Determine whether to use the cache or recompute the data
 bool use_cache = (frame_count % recompute_interval != 0);
 // Process the frame using cached or recomputed parameters
 hist(frame, cache, use_cache);
 // Display the processed frame
 cv::imshow("Processed Video", frame);
 // Break the loop if 'q' is pressed
 if (cv::waitKey(delay) == 'q') {
 break;
 }
 frame_count++;
 }
 cap.release();
 cv::destroyAllWindows();
 return 0;
}
asked Jul 22, 2024 at 9:43
\$\endgroup\$
15
  • 2
    \$\begingroup\$ I think your title is missing something \$\endgroup\$ Commented Jul 22, 2024 at 10:05
  • \$\begingroup\$ What is "faster"? Do you want it to run 10% faster, or 10x as fast? \$\endgroup\$ Commented Jul 22, 2024 at 19:41
  • \$\begingroup\$ @CrisLuengo I just want to go as far as I can with the optimization, but x10 is good enough \$\endgroup\$ Commented Jul 22, 2024 at 20:08
  • \$\begingroup\$ What parallelization method is your OpenCV built with? It does many different ones, maybe some are better than others? Some might start threads every time you start the parallel loop, instead of starting threads only once at the beginning of the program. Have you tried a different parallel model? \$\endgroup\$ Commented Jul 22, 2024 at 20:49
  • 1
    \$\begingroup\$ You can use any multi-threading library, or OpenMP, to simplify your work. Thread management is not trivial, if you can leave it to a library instead of starting with the low-level stdlib functionality you'll be better off. The only thing that parallel_for_ does is create threads, split the range into the number of threads, and call your worker function once within each thread. Your task is to move that thread creation to the start of the program. \$\endgroup\$ Commented Jul 23, 2024 at 20:52

3 Answers 3

7
\$\begingroup\$

Use std::array instead of std::vector

Since your lookup tables (LUTs) will have exactly 256 entries, just use std::array<..., 256> instead of std::vector for them. This avoids some pointer indirection and memory allocations.

Remove unused variables

Are the vectors data_* ever used? The code you have posted initializes them but doesn't actually use them for anything else. Just remove them.

Naming things

Why is the struct holding the lookup table named a Cache? I would just name it LUT or LookupTable.

The parameter use_cache is named deceptively. It doesn't tell whether to use the cache or not, since the function hist() will always use cache to transform the image. Instead, it determines whether to (re)calculate the lookup table. So I would rather rename it to recalculate_lut, but even better would be to remove that entirely, and if the caller wants the lookup table to be (re)calculated, it can call compute_data() itself.

compute_data() is also a very generic name. The function as it is now will create a linear lookup table, so perhaps rename it to compute_linear_lut().

Move functionality into the lookup table itself

Your struct Cache just holds data, nothing else. Consider turning into a class LUT which also has functions to initialize the lookup tables and apply them to a pixel:

class LUT {
 std::array<std::uint8_t, 256> r;
 std::array<std::uint8_t, 256> g;
 std::array<std::uint8_t, 256> b;
public:
 LUT() {
 std::iota(r.begin(), r.end(), 0);
 std::iota(g.begin(), g.end(), 0);
 std::iota(b.begin(), b.end(), 0);
 }
 cv::Vec3b operator()(cv::Vec3b input) {
 return {b[input[0]], g[input[1]], r[input[2]]};
 }
};

By overloading operator(), you can apply the LUT like this:

void hist(..., const LUT& lut)
{
 ...
 for (int j = 0; j < image.cols; ++j)
 {
 cv::Vec3b& pxi = row[j];
 pxi = lut(pxi);
 }
 ...
}

It has some other advantages as well, as I'll show below.

You don't care about rows and columns

Your are iterating over rows and columns, but applying a LUT is just a per-pixel operation that doesn't care about which row or column it is in. You can just iterate over all the elements of a cv::Mat. While you can still parallelize that using cv::parallel_for_, it's also possible to use C++'s own parallelization features. For example, you could write:

void hist(cv::Mat& image, const LUT& lut)
{
 std::transform(std::execution::par, image.begin(), image.end(), lut);
}

This makes use of the parallel form of std::transform(), which like OpenCV will automatically create threads to split the work amongst. While lut is a variable of type LUT, since it has an operator(), it can work like a function, so you can pass it to std::transform() here without having to wrap it into a lambda.

Framerate issues

It could very well be that hist() is not fast enough, depending on whether you compiled your code with optimizations enabled or not, and how large the frames of your movie are. However, regardless of how fast it is, your main() function will never display the processed frames at the right framerate. The problem is that in the while-loop, you read a frame, process it, display it, which will all take some amount of time, and only then will you wait for delay time. So each iteration of the loop will take more than delay time.

You will need to check the actual time (using std::chrono::steady_clock::now()), and then choose a delay value that compensates for the time already spent doing the other processing.

Faster image processing

It is likely that you could get better performance using the appropriate Arm NEON instructions, for example by making use of the TBL instruction. You could use compiler intrinsics instead of having to write assembly, but it will still require a good understanding of the Arm instruction set.

The Ambarella CV25S SoC seems to have dedicated hardware to do image processing, including support for color correction, which very likely is done using lookup tables, similar to your code. If you can find out how to make use of those hardware blocks, you can off-load the CPU. If the input video is in H.264 or H.265 format, then that SoC can also do the decoding of that for you. Maybe OpenCV already makes use of that, but if not, then it will have to do it on the CPU, which might be a bit much for a Cortex-A53.

answered Jul 24, 2024 at 15:26
\$\endgroup\$
1
  • \$\begingroup\$ Thank you for the detailed answer, very informative. I've tried to implement you remarks on my answer. For intrinsics solution, I need to learn that, I've no experience yet. \$\endgroup\$ Commented Jul 29, 2024 at 19:22
3
\$\begingroup\$

This is not a code review, I just wanted to show a way to create threads only once at the start of the program.

I'm using OpenMP for parallelism here, because it's the system I know best. It is very easy to use, but also doesn't allow for very fancy stuff. OpenMP is implemented by your compiler. You need to enable OpenMP both in the compilation and the linking step. The compiler will ignore the OpenMP pragmas if you don't enable it, making the program single-threaded.

This is the exact same code as in the OP, I didn't bother to change anything except adding the OpenMP pragmas. I also had to move the code from hist() into main(), I don't know if the original logic is possible using OpenCV. I have not even tried to compile the code, things might not work as advertised, but this is more or less what it would look like:

int main( int argc, char** argv ) {
 // Open the video file
 cv::VideoCapture cap( "../video.mp4" );
 if( !cap.isOpened() ) {
 std::cerr << "Error opening video file" << std::endl;
 return -1;
 }
 // Get the frame rate of the video
 double fps = cap.get( cv::CAP_PROP_FPS );
 int delay = static_cast< int >( 1000 / fps );
 // Create a window to display the video
 cv::namedWindow( "Processed Video", cv::WINDOW_NORMAL );
 cv::Mat frame;
 Cache cache;
 int frame_count = 0;
 int recompute_interval = 5; // Recompute every 5 frames
 #pragma omp parallel
 while( true ) { // The parallel section starts here, we've got all threads running now
 #pragma omp master // The next code block is run only by the master threads
 {
 cap >> frame;
 if( frame.empty() ) {
 break;
 }
 // Recompute the data every few frames
 if (frame_count % recompute_interval == 0) {
 compute_data(image, cache); // You'll have to figure out how to do this one in parallel too 
 }
 }
 #pragma omp barrier // The other threads wait until the master thread is done with the code above
 // Process the frame using cached or recomputed parameters
 #pragma omp for
 for (int i = 0; i < image.rows; ++i) { // This loop is run in parallel, OpenMP figures out how to split it among the threads
 cv::Vec3b* row = image.ptr<cv::Vec3b>(i);
 for (int j = 0; j < image.cols; ++j) { // This loop is not parallelized
 cv::Vec3b& pxi = row[j];
 pxi[0] = cache.lut_b[pxi[0]];
 pxi[1] = cache.lut_g[pxi[1]];
 pxi[2] = cache.lut_r[pxi[2]];
 }
 }
 #pragma omp master // Again, only the master thread does this part
 {
 // Display the processed frame
 cv::imshow( "Processed Video", frame );
 // Break the loop if 'q' is pressed
 if( cv::waitKey( delay ) == 'q' ) {
 break;
 }
 frame_count++;
 }
 #pragma omp barrier // All threads complete this loop iteration at the same time
 } // This is the end of the parallel section
 cap.release();
 cv::destroyAllWindows();
 return 0;
}

You can use any other multithreading library for this. Other libraries might allow you to write more modular or pretty code. But the idea is always the same: don't create threads anew for every image you process, create threads once at the start of the program, and have them do the work of processing each image in parallel. Creating threads takes a bit of time.

answered Jul 30, 2024 at 15:49
\$\endgroup\$
2
  • \$\begingroup\$ Okay thank you very much, I didn't know that right after "#pragma omp parallel" I get all the threads ready. \$\endgroup\$ Commented Jul 30, 2024 at 20:15
  • 1
    \$\begingroup\$ @Ja_cpp The one statement or block (enclosed in {}) after #pragma amp parallel is executed in parallel on all threads. So the whole while() block, in this case, is run in parallel. Each thread runs the same code, except where another #pragma omp tells them to do something different. \$\endgroup\$ Commented Jul 30, 2024 at 20:43
1
\$\begingroup\$

I've get inspired from @CrisLuengo and @G.-Sliepen advice and solutions and I implemented a class to call cv::parallel_for_. The results are 1810 fps vs 1974 fps which is already a good start. The fps were computed using std::chrono by iterating 20 times each frame and averaging over 180 frames of the video. We can see the blue plot is after optimization:

enter image description here

Thank you. Here is a working code:

#include <opencv2/opencv.hpp>
#include <iostream>
#include <vector>
#include <thread>
#include <numeric> // For std::iota
#include <array> //Structure to hold cached parameters
struct LookupTable {
 std::array<uchar, 256> lut_b;
 std::array<uchar, 256> lut_g;
 std::array<uchar, 256> lut_r;
};
// ParallelExecutor class
class ParallelExecutor {
public:
 ParallelExecutor(int numThreads) : numThreads(numThreads) {}
 template<typename Func, typename... Args>
 void parallelFor(int start, int end, Func func, Args&&... args) {
 int rangeSize = end - start;
 int chunkSize = (rangeSize + numThreads - 1) / numThreads;
 auto parallelLambda = [&](const cv::Range& range) {
 int localStart = start + range.start * chunkSize;
 int localEnd = std::min(localStart + chunkSize, end);
 func(cv::Range(localStart, localEnd), std::forward<Args>(args)...);
 };
 cv::parallel_for_(cv::Range(0, numThreads), parallelLambda);
 }
private:
 int numThreads;
};
// Function to compute simple example data and lookup tables
void compute_data(const cv::Mat& image, LookupTable& lut) {
 for (int i = 0; i < 256; i++) {
 lut.lut_b[i] = static_cast<uchar>(i);
 lut.lut_g[i] = static_cast<uchar>(i);
 lut.lut_r[i] = static_cast<uchar>(i);
 }
}
void hist_worker(const cv::Range& range, cv::Mat& image, LookupTable& lut) {
 for (int i = range.start; i < range.end; ++i) {
 cv::Vec3b* row = image.ptr<cv::Vec3b>(i);
 for (int j = 0; j < image.cols; ++j) {
 cv::Vec3b& pxi = row[j];
 pxi[0] = lut.lut_b[pxi[0]];
 pxi[1] = lut.lut_g[pxi[1]];
 pxi[2] = lut.lut_r[pxi[2]];
 }
 }
}
void hist(cv::Mat& image, LookupTable& lut, bool use_cache, ParallelExecutor& executor) {
 if (!use_cache) {
 compute_data(image, lut);
 }
 // Apply transformation using lookup tables in parallel
 executor.parallelFor(0, image.rows, hist_worker, image, lut);
}
int main(int argc, char** argv) {
 // Open the video file
 cv::VideoCapture cap("video.mp4");
 if (!cap.isOpened()) {
 std::cerr << "Error opening video file" << std::endl;
 return -1;
 }
 // Get the frame rate of the video
 double fps = cap.get(cv::CAP_PROP_FPS);
 int delay = static_cast<int>(1000 / fps);
 // Create a window to display the video
 cv::namedWindow("Processed Video", cv::WINDOW_NORMAL);
 cv::Mat frame;
 LookupTable lut;
 int frame_count = 0;
 int recompute_interval = 5; // Recompute every 5 frames
 ParallelExecutor executor(24); // Assuming 24 threads
 while (true) {
 cap >> frame;
 if (frame.empty()) {
 break;
 }
 // Determine whether to use the lut or recompute the data
 bool use_cache = (frame_count % recompute_interval != 0);
 // Process the frame using cached or recomputed parameters
 hist(frame, lut, use_cache, executor);
 // Display the processed frame
 cv::imshow("Processed Video", frame);
 // Break the loop if 'q' is pressed
 if (cv::waitKey(delay) == 'q') {
 break;
 }
 frame_count++;
 }
 cap.release();
 cv::destroyAllWindows();
 return 0;
}
answered Jul 29, 2024 at 19:04
\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.