7
\$\begingroup\$

This is a follow-up question for An Updated Multi-dimensional Image Data Structure with Variadic Template Functions in C++. I implemented histogram template function to get histogram of an image in this post.

The experimental implementation

  • histogram template function implementation

    // histogram template function implementation
    template<class ElementT = std::uint8_t>
    constexpr static auto histogram(const Image<ElementT>& input)
    {
     std::array<std::size_t, std::numeric_limits<ElementT>::max() - std::numeric_limits<ElementT>::lowest() + 1> histogram_output;
     std::fill(std::begin(histogram_output), std::end(histogram_output), std::size_t{ 0 });
     auto image_data = input.getImageData();
     for (std::size_t i = 0; i < image_data.size(); ++i)
     {
     ++histogram_output[image_data[i]];
     }
     return histogram_output;
    }
    
  • rgb2hsv template function implementation

    // rgb2hsv template function implementation
    template<typename ElementT, typename OutputT = HSV>
    requires (std::same_as<ElementT, RGB> || std::same_as<ElementT, RGB_DOUBLE>)
    constexpr static auto rgb2hsv(const Image<ElementT>& input)
    {
     return pixelwiseOperation([](ElementT input) { return rgb2hsv(input); }, input);
    }
    // rgb2hsv template function implementation
    template<class ExPo, typename ElementT, typename OutputT = HSV>
    requires (std::same_as<ElementT, RGB> || std::same_as<ElementT, RGB_DOUBLE>) && 
    (std::is_execution_policy_v<std::remove_cvref_t<ExPo>>)
    constexpr static auto rgb2hsv(ExPo execution_policy, const Image<ElementT>& input)
    {
     return pixelwiseOperation(execution_policy, [](ElementT input) { return rgb2hsv(input); }, input);
    }
    

The usage of histogram template function:

/* Developed by Jimmy Hu */
#include <chrono>
#include "../base_types.h"
#include "../basic_functions.h"
#include "../image.h"
#include "../image_io.h"
#include "../image_operations.h"
template<class ExPo, class ElementT>
requires (std::is_execution_policy_v<std::remove_cvref_t<ExPo>>)
constexpr static auto HistogramTest(
 ExPo execution_policy,
 const TinyDIP::Image<ElementT>& input,
 std::ostream& os = std::cout
)
{
 auto hsv_image = TinyDIP::rgb2hsv(execution_policy, input);
 auto grayscale_image = TinyDIP::im2uint8(TinyDIP::getVplane(hsv_image));
 auto start1 = std::chrono::system_clock::now();
 auto histogram_result1 = TinyDIP::histogram(grayscale_image);
 auto end1 = std::chrono::system_clock::now();
 std::chrono::duration<double> elapsed_seconds1 = end1 - start1;
 os << "elapsed time: " << elapsed_seconds1.count() << '\n';
 return histogram_result1;
}
int main()
{
 auto start = std::chrono::system_clock::now();
 std::string image_filename = "1.bmp";
 auto image_input = TinyDIP::bmp_read(image_filename.c_str(), true);
 image_input = TinyDIP::copyResizeBicubic(image_input, 3 * image_input.getWidth(), 3 * image_input.getHeight());
 auto histogram_result = HistogramTest(std::execution::par, image_input);
 for (std::size_t i = 0; i < histogram_result.size(); i++)
 {
 std::cout << i << " count: " << histogram_result[i] << " / " << image_input.count() << '\n';
 }
 std::cout << "Sum = " << TinyDIP::recursive_reduce(histogram_result, std::size_t{ 0 }) << '\n';
 auto end = std::chrono::system_clock::now();
 std::chrono::duration<double> elapsed_seconds = end - start;
 std::time_t end_time = std::chrono::system_clock::to_time_t(end);
 std::cout << "Computation finished at " << std::ctime(&end_time) << "elapsed time: " << elapsed_seconds.count() << '\n';
 return EXIT_SUCCESS;
}

The output of the test code above:

Width of the input image: 454
Height of the input image: 341
Size of the input image(Byte): 464442
elapsed time: 0.0007759
0 count: 1966 / 1393326
1 count: 65 / 1393326
2 count: 72 / 1393326
3 count: 75 / 1393326
4 count: 71 / 1393326
5 count: 49 / 1393326
6 count: 67 / 1393326
7 count: 52 / 1393326
8 count: 78 / 1393326
9 count: 58 / 1393326
10 count: 60 / 1393326
11 count: 58 / 1393326
12 count: 66 / 1393326
13 count: 75 / 1393326
14 count: 86 / 1393326
15 count: 98 / 1393326
16 count: 105 / 1393326
17 count: 139 / 1393326
18 count: 301 / 1393326
19 count: 554 / 1393326
20 count: 1052 / 1393326
21 count: 1551 / 1393326
22 count: 2407 / 1393326
23 count: 3593 / 1393326
24 count: 5618 / 1393326
25 count: 8144 / 1393326
26 count: 10541 / 1393326
27 count: 12335 / 1393326
28 count: 14096 / 1393326
29 count: 15881 / 1393326
30 count: 16950 / 1393326
31 count: 17881 / 1393326
32 count: 18950 / 1393326
33 count: 19432 / 1393326
34 count: 20085 / 1393326
35 count: 20687 / 1393326
36 count: 21367 / 1393326
37 count: 21758 / 1393326
38 count: 22450 / 1393326
39 count: 23288 / 1393326
40 count: 23501 / 1393326
41 count: 23818 / 1393326
42 count: 24173 / 1393326
43 count: 23949 / 1393326
44 count: 23445 / 1393326
45 count: 22823 / 1393326
46 count: 22218 / 1393326
47 count: 21807 / 1393326
48 count: 21137 / 1393326
49 count: 20491 / 1393326
50 count: 19820 / 1393326
51 count: 19644 / 1393326
52 count: 19148 / 1393326
53 count: 18567 / 1393326
54 count: 18190 / 1393326
55 count: 17536 / 1393326
56 count: 16935 / 1393326
57 count: 16640 / 1393326
58 count: 15821 / 1393326
59 count: 15337 / 1393326
60 count: 14813 / 1393326
61 count: 14085 / 1393326
62 count: 13722 / 1393326
63 count: 13101 / 1393326
64 count: 12752 / 1393326
65 count: 12188 / 1393326
66 count: 11973 / 1393326
67 count: 11385 / 1393326
68 count: 11036 / 1393326
69 count: 10681 / 1393326
70 count: 10205 / 1393326
71 count: 9761 / 1393326
72 count: 9574 / 1393326
73 count: 9162 / 1393326
74 count: 9056 / 1393326
75 count: 8700 / 1393326
76 count: 8270 / 1393326
77 count: 8096 / 1393326
78 count: 7671 / 1393326
79 count: 7575 / 1393326
80 count: 7366 / 1393326
81 count: 7262 / 1393326
82 count: 7011 / 1393326
83 count: 6692 / 1393326
84 count: 6533 / 1393326
85 count: 6459 / 1393326
86 count: 5996 / 1393326
87 count: 5783 / 1393326
88 count: 5764 / 1393326
89 count: 5554 / 1393326
90 count: 5413 / 1393326
91 count: 5310 / 1393326
92 count: 5069 / 1393326
93 count: 5021 / 1393326
94 count: 5026 / 1393326
95 count: 4816 / 1393326
96 count: 4788 / 1393326
97 count: 4561 / 1393326
98 count: 4528 / 1393326
99 count: 4296 / 1393326
100 count: 4308 / 1393326
101 count: 4316 / 1393326
102 count: 4123 / 1393326
103 count: 4023 / 1393326
104 count: 3850 / 1393326
105 count: 3873 / 1393326
106 count: 3731 / 1393326
107 count: 3737 / 1393326
108 count: 3685 / 1393326
109 count: 3566 / 1393326
110 count: 3505 / 1393326
111 count: 3462 / 1393326
112 count: 3383 / 1393326
113 count: 3350 / 1393326
114 count: 3264 / 1393326
115 count: 3204 / 1393326
116 count: 3216 / 1393326
117 count: 3187 / 1393326
118 count: 3191 / 1393326
119 count: 3163 / 1393326
120 count: 2903 / 1393326
121 count: 3040 / 1393326
122 count: 2956 / 1393326
123 count: 2899 / 1393326
124 count: 2802 / 1393326
125 count: 2916 / 1393326
126 count: 2833 / 1393326
127 count: 2745 / 1393326
128 count: 2918 / 1393326
129 count: 2843 / 1393326
130 count: 2795 / 1393326
131 count: 2718 / 1393326
132 count: 2758 / 1393326
133 count: 2781 / 1393326
134 count: 2821 / 1393326
135 count: 2755 / 1393326
136 count: 2763 / 1393326
137 count: 2739 / 1393326
138 count: 2770 / 1393326
139 count: 2705 / 1393326
140 count: 2812 / 1393326
141 count: 2733 / 1393326
142 count: 2766 / 1393326
143 count: 2792 / 1393326
144 count: 2790 / 1393326
145 count: 2813 / 1393326
146 count: 2674 / 1393326
147 count: 2655 / 1393326
148 count: 2544 / 1393326
149 count: 2562 / 1393326
150 count: 2542 / 1393326
151 count: 2668 / 1393326
152 count: 2586 / 1393326
153 count: 2521 / 1393326
154 count: 2525 / 1393326
155 count: 2497 / 1393326
156 count: 2503 / 1393326
157 count: 2513 / 1393326
158 count: 2517 / 1393326
159 count: 2406 / 1393326
160 count: 2526 / 1393326
161 count: 2571 / 1393326
162 count: 2531 / 1393326
163 count: 2504 / 1393326
164 count: 2536 / 1393326
165 count: 2454 / 1393326
166 count: 2529 / 1393326
167 count: 2651 / 1393326
168 count: 2530 / 1393326
169 count: 2519 / 1393326
170 count: 2522 / 1393326
171 count: 2600 / 1393326
172 count: 2614 / 1393326
173 count: 2603 / 1393326
174 count: 2748 / 1393326
175 count: 2711 / 1393326
176 count: 2783 / 1393326
177 count: 2877 / 1393326
178 count: 2921 / 1393326
179 count: 2909 / 1393326
180 count: 2814 / 1393326
181 count: 2747 / 1393326
182 count: 2648 / 1393326
183 count: 2615 / 1393326
184 count: 2510 / 1393326
185 count: 2526 / 1393326
186 count: 2363 / 1393326
187 count: 2284 / 1393326
188 count: 2036 / 1393326
189 count: 2022 / 1393326
190 count: 1908 / 1393326
191 count: 1912 / 1393326
192 count: 1920 / 1393326
193 count: 1818 / 1393326
194 count: 1896 / 1393326
195 count: 1977 / 1393326
196 count: 2016 / 1393326
197 count: 2030 / 1393326
198 count: 2004 / 1393326
199 count: 1855 / 1393326
200 count: 1660 / 1393326
201 count: 1485 / 1393326
202 count: 1395 / 1393326
203 count: 1314 / 1393326
204 count: 1278 / 1393326
205 count: 1064 / 1393326
206 count: 899 / 1393326
207 count: 850 / 1393326
208 count: 847 / 1393326
209 count: 824 / 1393326
210 count: 834 / 1393326
211 count: 838 / 1393326
212 count: 839 / 1393326
213 count: 833 / 1393326
214 count: 913 / 1393326
215 count: 844 / 1393326
216 count: 807 / 1393326
217 count: 827 / 1393326
218 count: 837 / 1393326
219 count: 846 / 1393326
220 count: 889 / 1393326
221 count: 863 / 1393326
222 count: 871 / 1393326
223 count: 931 / 1393326
224 count: 909 / 1393326
225 count: 924 / 1393326
226 count: 950 / 1393326
227 count: 878 / 1393326
228 count: 895 / 1393326
229 count: 833 / 1393326
230 count: 909 / 1393326
231 count: 959 / 1393326
232 count: 870 / 1393326
233 count: 943 / 1393326
234 count: 915 / 1393326
235 count: 870 / 1393326
236 count: 897 / 1393326
237 count: 986 / 1393326
238 count: 925 / 1393326
239 count: 1003 / 1393326
240 count: 999 / 1393326
241 count: 1016 / 1393326
242 count: 1003 / 1393326
243 count: 1032 / 1393326
244 count: 1067 / 1393326
245 count: 1062 / 1393326
246 count: 1148 / 1393326
247 count: 1188 / 1393326
248 count: 1311 / 1393326
249 count: 1391 / 1393326
250 count: 1522 / 1393326
251 count: 1704 / 1393326
252 count: 2114 / 1393326
253 count: 2766 / 1393326
254 count: 4315 / 1393326
255 count: 39663 / 1393326
Sum = 1393326
Computation finished at Thu Feb 20 14:09:02 2025
elapsed time: 0.701304

All suggestions are welcome.

TinyDIP on GitHub

All suggestions are welcome.

The summary information:

asked Feb 20 at 6:25
\$\endgroup\$
1
  • \$\begingroup\$ Both templates rgb2hsv call another function of the same name that is not in the post. \$\endgroup\$ Commented Feb 20 at 7:58

2 Answers 2

7
\$\begingroup\$

You create an array with

std::numeric_limits<ElementT>::max() - std::numeric_limits<ElementT>::lowest() + 1

elements, but then use the array as if std::numeric_limits<ElementT>::lowest() is 0. This obviously will index out of bounds if ElementT is a signed type and the image has negative values.

If ElementT is a 32-bit type, your histogram will be huge (and probably mostly unused). If it’s a 64-bit type you won’t be able to allocate the array. If it’s a floating-point type, you’ll be requesting a size using a floating-point type, which should be a compiler error (also a much larger value than what fits in size_t.

At least limit your template to 8 and 16-bit unsigned ints. Better would be to create another function for cases where the type is larger, where you decide on bin size and limits and compute which bin each value falls into.

I’m not sure if there ever is a situation where the input is known at compile time, and the compiler can compute the histogram. But it’s cool that a function like this can be constexpr!

answered Feb 20 at 8:11
\$\endgroup\$
7
\$\begingroup\$
std::fill(std::begin(histogram_output), std::end(histogram_output), std::size_t{ 0 });

A simpler solution is to aggregate-initialize the array (braces) instead of default-initializing it:

std::array<std::size_t, /* skipped to save space */> histogram_output{};
for (std::size_t i = 0; i < image_data.size(); ++i)
{
 ++histogram_output[image_data[i]];
}

This is the simplest histogram algorithm that isn't bad, so it has a good "performance to complexity" ratio. There are some techniques to do it faster, they will introduce more complexity, so whether that's worth it is a trade-off that you'd have to decide. Turbo-Histogram has collected many of those techniques.

A significant downside of the simple algorithm is as follows. If the same entry of the histogram is incremented again when it has recently been incremented already (as happens not very often for random data, but happens significantly often for various kinds of structured data including non-random images) then the second increment has a dependency on the first increment through the associated store and load. How bad that is depends on microarchitectural details of the CPU you're running the code on, but it generally isn't great and tends to result in significantly lower performance than independent increments.

The algorithms available in Turbo-Histogram do some different things:

  • Build multiple independent histograms, and sum them in the end. This gives the CPU some independent work to do even if the same value occurs multiple times in a short span.
  • Detect short runs of equal values and count them all in one go (in clever ways that make detection cheap, but that keep runs somewhat short, eg up to 8 bytes). This avoids the worst case of the simple algorithm and turns it into the best case: runs of equal values can be counted much faster than one-at-a-time. But, if runs of equal values are not common enough, trying to find them is not necessarily a win.

On top of that there are various optimizations such as replacing 16 individual byte loads with one 128-bit load (_mm_loadu_si128) and 16 extract-byte operations (_mm_extract_epi8), which variant is best depends a lot on the specific CPU microarchitecture: the costs of various operations on that microarchitecture and specific quirks. It's different between Intel Haswell and Intel Rocket Lake and so on.

Of these, the only technique that can easily be agnostic of ElementT is building multiple independent histograms and summing them in the end. Other techniques could be used by having a specialization for the (presumably most common) case that ElementT = std::uint8_t.

ElementT had better be an unsigned integral type and not too large, otherwise the whole thing breaks down. Perhaps that ought to be enforced.

answered Feb 20 at 7:45
\$\endgroup\$
2
  • \$\begingroup\$ I'd guess that on x86, 32-bit or 64-bit scalar loads and unpacks would be the way to go if you want to relieve pressure on the load / AGU ports. e.g. mov eax, [rdi] / movzx ecx, al / movzx edx, ah / shr eax, 16 and repeat. How you get a compiler to emit that, IDK. The load with memcpy sure, but then it's up to the optimizer to use movzx to read from AH instead of doing more shifts and ANDs. Reading AH has extra latency but still good throughput at least on SKL. \$\endgroup\$ Commented Feb 21 at 22:45
  • \$\begingroup\$ By comparison, vpextrb is 2 uops per scalar element; a shuffle (port 5) and an XMM->GPR uop (port 0 like movd). Similar on Zen 4: 2 uops with 1/clock throughput. (uops.info) \$\endgroup\$ Commented Feb 21 at 22:47

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.