The problem is given image in 32 Bit Floating Point Format (float
) how to convert it to UINT8 (char
) or UNIT16 (unsigned short
) in the most efficient way.
Assumptions:
- Image is very long, namely don't be troubled with edge cases.
- One can not assume the image is in any given range.
- Output data must be clipped into the proper range ([0, 255], [0, 2 ^ 16 - 1]). I think the term is to be saturated.
- Data should be rounded to nearest integer.
- One could use SSE4 operations (Namely, everything but AVX).
I'd rather have a code in C style.
But C++ is also OK (I will translate it on my own).
I tried this simple code for the UINT8 case:
void ConvertToUint8(char* mO, float* mF, float* mB, int numRows, int numCols, int numColsPad, float scalingFctr)
{
// mB - Background, mF - Foreground
int ii, jj;
__m128 floatPx;
__m128 scalingFactor;
__m128i uint8Px;
scalingFactor = _mm_set1_ps(scalingFctr);
#pragma omp parallel for private(jj, floatPx, scalingFactor, uint8Px)
for (ii = 0; ii < numRows; ii++) {
for (jj = 0; jj < numCols; jj += SSE_STRIDE) {
floatPx = _mm_loadu_ps(&mB[(ii * numColsPad) + jj]);
uint8Px = _mm_cvtps_epi32(_mm_mul_ps(floatPx, scalingFactor));
uint8Px = _mm_packus_epi32(uint8Px, uint8Px);
uint8Px = _mm_packus_epi16(uint8Px, uint8Px);
*(int *) (mO[(ii * numColsPad) + jj]) = _mm_cvtsi128_si32(uint8Px);
}
}
}
The code above is based on StackOverflow Q29856006 - SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers.
But I'm sure there are better ways (For instance, why not load 4 packed __m128
and then store 32 Pixels at once)?
3 Answers 3
why not load 4 packed __m128 and then store 32 Pixels at once
I think that's 16 pixels, but it's a good plan. The pack instructions were used inefficiently (in the linked question that was not really an issue) and that would be improved by processing more elements simultaneously. For example, something like this:
#define SSE_STRIDE 16
void ConvertToUint8(char* mO, float* mF, float* mB, unsigned numRows, unsigned numCols, unsigned numColsPad, float scalingFctr)
{
unsigned ii, jj;
__m128 a, b, c, d;
__m128i ai, bi, ci, di, p0, p1;
__m128 scalingFactor = _mm_set1_ps(scalingFctr);
float *inptr;
__m128i *outptr;
#pragma omp parallel for private(jj, a, b, c, d, ai, bi, ci, di, p0, p1, inptr, outptr)
for (ii = 0; ii < numRows; ii++) {
inptr = &mB[ii * numColsPad];
outptr = (__m128i*)&mO[ii * numColsPad];
for (jj = 0; jj < numCols; jj += SSE_STRIDE) {
a = _mm_loadu_ps(inptr);
b = _mm_loadu_ps(inptr + 4);
c = _mm_loadu_ps(inptr + 8);
d = _mm_loadu_ps(inptr + 12);
inptr += SSE_STRIDE;
ai = _mm_cvtps_epi32(_mm_mul_ps(a, scalingFactor));
bi = _mm_cvtps_epi32(_mm_mul_ps(b, scalingFactor));
ci = _mm_cvtps_epi32(_mm_mul_ps(c, scalingFactor));
di = _mm_cvtps_epi32(_mm_mul_ps(d, scalingFactor));
p0 = _mm_packs_epi32(ai, bi),
p1 = _mm_packs_epi32(ci, di);
p0 = _mm_packus_epi16(p0, p1);
_mm_storeu_si128(outptr++, p0);
}
}
}
I've also changed some thing to unsigned
, because some useless sign-extension was going on sometimes (depending on how it was compiled). inptr
and outptr
are now calculated outside the inner loop, normally we can expect the compiler to be pretty clever about that too but compiling with OpenMP support seemed to make Clang less clever, it put the actual multiplication between ii
and numColsPad
inside the inner loop. That's not good, because on Intel that imul
would go to p1, and p1 is already completely packed with the float-to-int conversions and fp-multiplications (eg on Skylake both of those go to p01). That loop should be able to run at 1 iteration every 4 cycles in the best case on most architectures, but with that extra imul
in there that wouldn't be possible.
To convert to uint16
something similar can be used, but with _mm_packus_epi32
and no packing to bytes.
-
\$\begingroup\$ In the code above
SSE_STRIDE
is 4. Isn'tinptr += 16
should be the way to move forward the pointer? Or maybe I miss something with the way you update the pointers. \$\endgroup\$Royi– Royi2017年10月22日 18:12:15 +00:00Commented Oct 22, 2017 at 18:12 -
\$\begingroup\$ @Royi oh I redefined it to be 16, should I not have done that? I thought you would use that elsewhere to determine the range of this loop \$\endgroup\$user555045– user5550452017年10月22日 18:14:07 +00:00Commented Oct 22, 2017 at 18:14
-
\$\begingroup\$ No problem. I just wanted to know the value you assume it to be. As I forgot to mention it. \$\endgroup\$Royi– Royi2017年10月22日 18:17:27 +00:00Commented Oct 22, 2017 at 18:17
-
\$\begingroup\$ Harold, Will your code round towards zero (
fix
) or to the nearest integer? How could I choose the rounding mode? Thank You. \$\endgroup\$Royi– Royi2017年10月22日 19:06:37 +00:00Commented Oct 22, 2017 at 19:06 -
\$\begingroup\$ @Royi with the usual MXCSR settings,
_mm_cvtps_epi32
rounds to nearest (this can be changed with_mm_setcsr
and should be restored afterwards)._mm_cvttps_epi32
truncates. \$\endgroup\$user555045– user5550452017年10月22日 19:12:59 +00:00Commented Oct 22, 2017 at 19:12
Get Rid of Unused Arguments
There are arguments passed to this function that are not used. It appears that the foreground array mF
doesn't get read or modified, so remove it from the list of arguments as it's just confusing.
Naming
The names of your function's arguments are extremely cryptic. mO
is the output buffer? If so, name it outputBuffer
or outputImage
or whatever's appropriate.
Whenever you see something like this in your code:
// mB - Background, mF - Foreground
you know you've failed to name your variables properly. They should just be background
and foreground
. (Also background and foreground of what? Was this originally a compositing operation or something?)
Also don't do this:
scalingFactor = _mm_set1_ps(scalingFctr);
How is anyone reading the code supposed to tell the difference between scalingFactor
and scalingFctr
? There's nothing about either name that distinguishes its use or difference to someone reading the code. (And remember, you may have to read this code after months of not looking at it!)
Performance
I would try doing this operation on the GPU. It was built to do this exact sort of thing. In OpenGL you could simply upload the float image as a texture, then draw to a texture-backed FBO (Frame Buffer Object) where the backing texture is an 8-bit per channel image. You'd simply draw a textured quad with the desired texture applied.
The reason I recommend this approach is that while using SIMD instructions will improve performance by 2-16 times, and using multiple threads will further improve performance, you're unlikely to be able to beat a GPU running hundreds to thousands of cores.
-
\$\begingroup\$ Hi, First I fixed the
mF
andmB
. It should bemI
. In code I write (Many small functions for basic image processing) I use the same convention. The prefixm
stands for Matrix andO
andI
are output / input. Regarding GPU, moving the data to the GPU and getting it back will take more time than the operation itself. \$\endgroup\$Royi– Royi2017年10月22日 05:51:47 +00:00Commented Oct 22, 2017 at 5:51 -
\$\begingroup\$ Broadcasting a scalar to a vector and needing a different variable name for the vector is a chronic problem for manual vectorization. At least if you use the wrong one, your code won't compile. If I don't need the scalar variable inside the function, I sometimes call it
foo_scalar
in the arg list. The OP certainly makes it worse by declaring the__m128
variable separately from initializing it, though. (Related:numColsPad
is a poor name for the row stride, like I said in an answer to a previous question from the OP \$\endgroup\$Peter Cordes– Peter Cordes2017年10月22日 15:55:09 +00:00Commented Oct 22, 2017 at 15:55 -
\$\begingroup\$ @Royi Regarding the GPU - first don't assume that. Test it and see if it's actually the case. Second, chances are you are going to do more work than just converting an image from
float
toUInt8
orUInt16
, anyway. (Given the other questions you've asked here, it certainly seems that way.) If you're doing more work, then the huge number of cores on the GPU will almost certainly (speaking from experience) be a win. But as always, the only way to know for sure is to actually implement it and profile it. \$\endgroup\$user1118321– user11183212017年10月22日 16:27:02 +00:00Commented Oct 22, 2017 at 16:27
The code I wrote at the end, based on harold answer is as following:
void ConvertToUint8(char* mO, float* mI, int numRows, int numCols, int numColsPad, float scalingFctr)
{
int ii, jj, numColsQuadPack;
float *ptrInputImage;
int *ptrOutputImage;
__m128 floatPx1, floatPx2, floatPx3, floatPx4;
__m128 scalingFactor;
__m128i uint16Px1, uint16Px2, uint16Px3, uint16Px4;
__m128i uint8Px1, uint8Px2;
__m128i *ptrOutputImageSse;
numColsQuadPack = numCols - (numCols % SSE_STRIDE_QUAD);
scalingFactor = _mm_set1_ps(scalingFctr);
#pragma omp parallel for private(jj, ptrInputImage, ptrOutputImage, floatPx1, floatPx2, floatPx3, floatPx4, uint16Px1, uint16Px2, uint16Px3, uint16Px4, uint8Px1, uint8Px2)
for (ii = 0; ii < numRows; ii++) {
ptrInputImage = &mI[ii * numColsPad];
ptrOutputImageSse = (__m128i*)(&mO[ii * numColsPad]);
for (jj = 0; jj < numColsQuadPack; jj += SSE_STRIDE_QUAD) {
// SSE Pack is 4 Floats (4 * 32 Byte) -> 16 UINT8 (16 * 1 Byte)
// Hence loading 16 Floats which will be converted into 16 UINT8
floatPx1 = _mm_loadu_ps(ptrInputImage);
floatPx2 = _mm_loadu_ps(ptrInputImage + SSE_STRIDE);
floatPx3 = _mm_loadu_ps(ptrInputImage + SSE_STRIDE_DOUBLE);
floatPx4 = _mm_loadu_ps(ptrInputImage + SSE_STRIDE_TRIPLE);
ptrInputImage += SSE_STRIDE_QUAD;
// _mm_cvtps_epi32 - Rounds to nearest integer
// _mm_cvttps_epi32 - Truncates (Rounding towards zero)
uint16Px1 = _mm_cvtps_epi32(_mm_mul_ps(floatPx1, scalingFactor)); // Converts the 4 SP FP values of a to 4 Signed Integers (32 Bit).
uint16Px2 = _mm_cvtps_epi32(_mm_mul_ps(floatPx2, scalingFactor));
uint16Px3 = _mm_cvtps_epi32(_mm_mul_ps(floatPx3, scalingFactor));
uint16Px4 = _mm_cvtps_epi32(_mm_mul_ps(floatPx4, scalingFactor));
// See Intel Miscellaneous Intrinsics (https://software.intel.com/en-us/node/695374)
uint8Px1 = _mm_packs_epi32(uint16Px1, uint16Px2); // Saturating and packing 2 of 4 Integers into 8 of INT16
uint8Px2 = _mm_packs_epi32(uint16Px3, uint16Px4); // Saturating and packing 2 of 4 Integers into 8 of INT16
uint8Px1 = _mm_packus_epi16(uint8Px1, uint8Px2); // Saturating and packing 2 of 8 INT16 into 16 of UINT8
_mm_storeu_si128(ptrOutputImageSse++, uint8Px1); // Storing 16 UINT8, Promoting the pointer
}
ptrOutputImage = (int*)(&mO[(ii * numColsPad) + numColsQuadPack]);
for (jj = numColsQuadPack; jj < numCols; jj += SSE_STRIDE) {
floatPx1 = _mm_loadu_ps(ptrInputImage);
ptrInputImage += SSE_STRIDE;
uint16Px1 = _mm_cvtps_epi32(_mm_mul_ps(floatPx1, scalingFactor));
uint8Px1 = _mm_packs_epi32(uint16Px1, uint16Px1);
uint8Px1 = _mm_packus_epi16(uint8Px1, uint8Px1);
*ptrOutputImage = _mm_cvtsi128_si32(uint8Px1);
ptrOutputImage++;
}
}
}
The variable scalingFactor
must be out of the private()
declaration of the OpenMP pragma.
I don't know why, but this is how it works best.
If you find a way to improve this or make it more efficient, feel free to edit.
-
\$\begingroup\$ I think I fixed it. \$\endgroup\$Royi– Royi2017年10月22日 20:26:57 +00:00Commented Oct 22, 2017 at 20:26
-
\$\begingroup\$ It's still incremented as a pointer to
__m128i
which is bad if there is more than one tail iteration, it's probably best to just make anint*
variable for that tail loop \$\endgroup\$user555045– user5550452017年10月22日 20:27:59 +00:00Commented Oct 22, 2017 at 20:27 -
\$\begingroup\$ Can I make something else to make its jump appropriate instead of a new pointer? \$\endgroup\$Royi– Royi2017年10月22日 20:32:43 +00:00Commented Oct 22, 2017 at 20:32
-
\$\begingroup\$ You could access the output through
mO[ii * numColsPad + jj]
again \$\endgroup\$user555045– user5550452017年10月22日 20:39:38 +00:00Commented Oct 22, 2017 at 20:39 -
\$\begingroup\$ OK, I updated it and it seems to work. Harold, thank you for the patience and assistance. \$\endgroup\$Royi– Royi2017年10月22日 21:19:38 +00:00Commented Oct 22, 2017 at 21:19
Explore related questions
See similar questions with these tags.
_mm_packus_epi32
first probably. \$\endgroup\$_mm_packus_epi16
still treats its input as signed, so the first step in a 2-step pack should bepackss
to create input forpackus
. And yes, of course you should be using eachpack
with 2 different inputs. \$\endgroup\$mF
andmB
. It doesn't change the validity of any answer. It was done because it made user1118321 confused. \$\endgroup\$