Converting Array of Floats to UINT8 (`char`) or UINT16 (`unsigned short`) Using SSE4

Question 1

The problem is given image in 32 Bit Floating Point Format (float) how to convert it to UINT8 (char) or UNIT16 (unsigned short) in the most efficient way.

Assumptions:

Image is very long, namely don't be troubled with edge cases.
One can not assume the image is in any given range.
Output data must be clipped into the proper range ([0, 255], [0, 2 ^ 16 - 1]). I think the term is to be saturated.
Data should be rounded to nearest integer.
One could use SSE4 operations (Namely, everything but AVX).

I'd rather have a code in C style.
But C++ is also OK (I will translate it on my own).

I tried this simple code for the UINT8 case:

void ConvertToUint8(char* mO, float* mF, float* mB, int numRows, int numCols, int numColsPad, float scalingFctr)
{
 // mB - Background, mF - Foreground
 int ii, jj;
 __m128 floatPx;
 __m128 scalingFactor;
 __m128i uint8Px;
 scalingFactor = _mm_set1_ps(scalingFctr);
#pragma omp parallel for private(jj, floatPx, scalingFactor, uint8Px)
 for (ii = 0; ii < numRows; ii++) {
 for (jj = 0; jj < numCols; jj += SSE_STRIDE) {
 floatPx = _mm_loadu_ps(&mB[(ii * numColsPad) + jj]);
 uint8Px = _mm_cvtps_epi32(_mm_mul_ps(floatPx, scalingFactor));
 uint8Px = _mm_packus_epi32(uint8Px, uint8Px);
 uint8Px = _mm_packus_epi16(uint8Px, uint8Px);
 *(int *) (mO[(ii * numColsPad) + jj]) = _mm_cvtsi128_si32(uint8Px);
 }
 }
}

The code above is based on StackOverflow Q29856006 - SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers.

But I'm sure there are better ways (For instance, why not load 4 packed __m128 and then store 32 Pixels at once)?

Question 2

What should happen with the saturation? I shouldn't have put _mm_packus_epi32 first probably.

Question 3

@harold It should be limited in the range [0, 255] (Updated assumptions accordingly). Feel free to offer a different code. I'm pretty struggling with it. Thank You.

Question 4

_mm_packus_epi16 still treats its input as signed, so the first step in a 2-step pack should be packss to create input for packus. And yes, of course you should be using each pack with 2 different inputs.

Question 5

I have rolled back the last edit. Please do not add, remove, or edit code in a question after you've received an answer. The site policy is explained in What to do when someone answers.

Question 6

@MartinR, nothing changed in the code. I made mistake adding something not used mF and mB. It doesn't change the validity of any answer. It was done because it made user1118321 confused.

Question 7

why not load 4 packed __m128 and then store 32 Pixels at once

I think that's 16 pixels, but it's a good plan. The pack instructions were used inefficiently (in the linked question that was not really an issue) and that would be improved by processing more elements simultaneously. For example, something like this:

#define SSE_STRIDE 16
void ConvertToUint8(char* mO, float* mF, float* mB, unsigned numRows, unsigned numCols, unsigned numColsPad, float scalingFctr)
{
 unsigned ii, jj;
 __m128 a, b, c, d;
 __m128i ai, bi, ci, di, p0, p1;
 __m128 scalingFactor = _mm_set1_ps(scalingFctr);
 float *inptr;
 __m128i *outptr;
#pragma omp parallel for private(jj, a, b, c, d, ai, bi, ci, di, p0, p1, inptr, outptr)
 for (ii = 0; ii < numRows; ii++) {
 inptr = &mB[ii * numColsPad];
 outptr = (__m128i*)&mO[ii * numColsPad];
 for (jj = 0; jj < numCols; jj += SSE_STRIDE) {
 a = _mm_loadu_ps(inptr);
 b = _mm_loadu_ps(inptr + 4);
 c = _mm_loadu_ps(inptr + 8);
 d = _mm_loadu_ps(inptr + 12);
 inptr += SSE_STRIDE;
 ai = _mm_cvtps_epi32(_mm_mul_ps(a, scalingFactor));
 bi = _mm_cvtps_epi32(_mm_mul_ps(b, scalingFactor));
 ci = _mm_cvtps_epi32(_mm_mul_ps(c, scalingFactor));
 di = _mm_cvtps_epi32(_mm_mul_ps(d, scalingFactor));
 p0 = _mm_packs_epi32(ai, bi),
 p1 = _mm_packs_epi32(ci, di);
 p0 = _mm_packus_epi16(p0, p1);
 _mm_storeu_si128(outptr++, p0);
 }
 }
}

I've also changed some thing to unsigned, because some useless sign-extension was going on sometimes (depending on how it was compiled). inptr and outptr are now calculated outside the inner loop, normally we can expect the compiler to be pretty clever about that too but compiling with OpenMP support seemed to make Clang less clever, it put the actual multiplication between ii and numColsPad inside the inner loop. That's not good, because on Intel that imul would go to p1, and p1 is already completely packed with the float-to-int conversions and fp-multiplications (eg on Skylake both of those go to p01). That loop should be able to run at 1 iteration every 4 cycles in the best case on most architectures, but with that extra imul in there that wouldn't be possible.

To convert to uint16 something similar can be used, but with _mm_packus_epi32 and no packing to bytes.

Question 8

In the code above SSE_STRIDE is 4. Isn't inptr += 16 should be the way to move forward the pointer? Or maybe I miss something with the way you update the pointers.

Question 9

@Royi oh I redefined it to be 16, should I not have done that? I thought you would use that elsewhere to determine the range of this loop

Question 10

No problem. I just wanted to know the value you assume it to be. As I forgot to mention it.

Question 11

Harold, Will your code round towards zero (fix) or to the nearest integer? How could I choose the rounding mode? Thank You.

Question 12

@Royi with the usual MXCSR settings, _mm_cvtps_epi32 rounds to nearest (this can be changed with _mm_setcsr and should be restored afterwards). _mm_cvttps_epi32 truncates.

Question 13

Get Rid of Unused Arguments

There are arguments passed to this function that are not used. It appears that the foreground array mF doesn't get read or modified, so remove it from the list of arguments as it's just confusing.

Naming

The names of your function's arguments are extremely cryptic. mO is the output buffer? If so, name it outputBuffer or outputImage or whatever's appropriate.

Whenever you see something like this in your code:

// mB - Background, mF - Foreground

you know you've failed to name your variables properly. They should just be background and foreground. (Also background and foreground of what? Was this originally a compositing operation or something?)

Also don't do this:

scalingFactor = _mm_set1_ps(scalingFctr);

How is anyone reading the code supposed to tell the difference between scalingFactor and scalingFctr? There's nothing about either name that distinguishes its use or difference to someone reading the code. (And remember, you may have to read this code after months of not looking at it!)

Performance

I would try doing this operation on the GPU. It was built to do this exact sort of thing. In OpenGL you could simply upload the float image as a texture, then draw to a texture-backed FBO (Frame Buffer Object) where the backing texture is an 8-bit per channel image. You'd simply draw a textured quad with the desired texture applied.

The reason I recommend this approach is that while using SIMD instructions will improve performance by 2-16 times, and using multiple threads will further improve performance, you're unlikely to be able to beat a GPU running hundreds to thousands of cores.

Question 14

Hi, First I fixed the mF and mB. It should be mI. In code I write (Many small functions for basic image processing) I use the same convention. The prefix m stands for Matrix and O and I are output / input. Regarding GPU, moving the data to the GPU and getting it back will take more time than the operation itself.

Question 15

Broadcasting a scalar to a vector and needing a different variable name for the vector is a chronic problem for manual vectorization. At least if you use the wrong one, your code won't compile. If I don't need the scalar variable inside the function, I sometimes call it foo_scalar in the arg list. The OP certainly makes it worse by declaring the __m128 variable separately from initializing it, though. (Related: numColsPad is a poor name for the row stride, like I said in an answer to a previous question from the OP

Question 16

@Royi Regarding the GPU - first don't assume that. Test it and see if it's actually the case. Second, chances are you are going to do more work than just converting an image from float to UInt8 or UInt16, anyway. (Given the other questions you've asked here, it certainly seems that way.) If you're doing more work, then the huge number of cores on the GPU will almost certainly (speaking from experience) be a win. But as always, the only way to know for sure is to actually implement it and profile it.

Question 17

The code I wrote at the end, based on harold answer is as following:

void ConvertToUint8(char* mO, float* mI, int numRows, int numCols, int numColsPad, float scalingFctr)
{
 int ii, jj, numColsQuadPack;
 float *ptrInputImage;
 int *ptrOutputImage;
 __m128 floatPx1, floatPx2, floatPx3, floatPx4;
 __m128 scalingFactor;
 __m128i uint16Px1, uint16Px2, uint16Px3, uint16Px4;
 __m128i uint8Px1, uint8Px2;
 __m128i *ptrOutputImageSse;
 numColsQuadPack = numCols - (numCols % SSE_STRIDE_QUAD);
 scalingFactor = _mm_set1_ps(scalingFctr);
#pragma omp parallel for private(jj, ptrInputImage, ptrOutputImage, floatPx1, floatPx2, floatPx3, floatPx4, uint16Px1, uint16Px2, uint16Px3, uint16Px4, uint8Px1, uint8Px2)
 for (ii = 0; ii < numRows; ii++) {
 ptrInputImage = &mI[ii * numColsPad];
 ptrOutputImageSse = (__m128i*)(&mO[ii * numColsPad]);
 for (jj = 0; jj < numColsQuadPack; jj += SSE_STRIDE_QUAD) {
 // SSE Pack is 4 Floats (4 * 32 Byte) -> 16 UINT8 (16 * 1 Byte)
 // Hence loading 16 Floats which will be converted into 16 UINT8
 floatPx1 = _mm_loadu_ps(ptrInputImage);
 floatPx2 = _mm_loadu_ps(ptrInputImage + SSE_STRIDE);
 floatPx3 = _mm_loadu_ps(ptrInputImage + SSE_STRIDE_DOUBLE);
 floatPx4 = _mm_loadu_ps(ptrInputImage + SSE_STRIDE_TRIPLE);
 ptrInputImage += SSE_STRIDE_QUAD;
 // _mm_cvtps_epi32 - Rounds to nearest integer
 // _mm_cvttps_epi32 - Truncates (Rounding towards zero)
 uint16Px1 = _mm_cvtps_epi32(_mm_mul_ps(floatPx1, scalingFactor)); // Converts the 4 SP FP values of a to 4 Signed Integers (32 Bit).
 uint16Px2 = _mm_cvtps_epi32(_mm_mul_ps(floatPx2, scalingFactor));
 uint16Px3 = _mm_cvtps_epi32(_mm_mul_ps(floatPx3, scalingFactor));
 uint16Px4 = _mm_cvtps_epi32(_mm_mul_ps(floatPx4, scalingFactor));
 // See Intel Miscellaneous Intrinsics (https://software.intel.com/en-us/node/695374)
 uint8Px1 = _mm_packs_epi32(uint16Px1, uint16Px2); // Saturating and packing 2 of 4 Integers into 8 of INT16
 uint8Px2 = _mm_packs_epi32(uint16Px3, uint16Px4); // Saturating and packing 2 of 4 Integers into 8 of INT16
 uint8Px1 = _mm_packus_epi16(uint8Px1, uint8Px2); // Saturating and packing 2 of 8 INT16 into 16 of UINT8
 _mm_storeu_si128(ptrOutputImageSse++, uint8Px1); // Storing 16 UINT8, Promoting the pointer
 }
 ptrOutputImage = (int*)(&mO[(ii * numColsPad) + numColsQuadPack]);
 for (jj = numColsQuadPack; jj < numCols; jj += SSE_STRIDE) {
 floatPx1 = _mm_loadu_ps(ptrInputImage);
 ptrInputImage += SSE_STRIDE;
 uint16Px1 = _mm_cvtps_epi32(_mm_mul_ps(floatPx1, scalingFactor));
 uint8Px1 = _mm_packs_epi32(uint16Px1, uint16Px1);
 uint8Px1 = _mm_packus_epi16(uint8Px1, uint8Px1);
 *ptrOutputImage = _mm_cvtsi128_si32(uint8Px1);
 ptrOutputImage++;
 }
 }
}

The variable scalingFactor must be out of the private() declaration of the OpenMP pragma.
I don't know why, but this is how it works best.

If you find a way to improve this or make it more efficient, feel free to edit.

Question 18

I think I fixed it.

Question 19

It's still incremented as a pointer to __m128i which is bad if there is more than one tail iteration, it's probably best to just make an int* variable for that tail loop

Question 20

Can I make something else to make its jump appropriate instead of a new pointer?

Question 21

You could access the output through mO[ii * numColsPad + jj] again

Question 22

OK, I updated it and it seems to work. Harold, thank you for the patience and assistance.

user555045 12.7k1 gold badge19 silver badges39 bronze badges · Accepted Answer · 2017-10-21 21:57:09Z

why not load 4 packed __m128 and then store 32 Pixels at once

I think that's 16 pixels, but it's a good plan. The pack instructions were used inefficiently (in the linked question that was not really an issue) and that would be improved by processing more elements simultaneously. For example, something like this:

#define SSE_STRIDE 16
void ConvertToUint8(char* mO, float* mF, float* mB, unsigned numRows, unsigned numCols, unsigned numColsPad, float scalingFctr)
{
 unsigned ii, jj;
 __m128 a, b, c, d;
 __m128i ai, bi, ci, di, p0, p1;
 __m128 scalingFactor = _mm_set1_ps(scalingFctr);
 float *inptr;
 __m128i *outptr;
#pragma omp parallel for private(jj, a, b, c, d, ai, bi, ci, di, p0, p1, inptr, outptr)
 for (ii = 0; ii < numRows; ii++) {
 inptr = &mB[ii * numColsPad];
 outptr = (__m128i*)&mO[ii * numColsPad];
 for (jj = 0; jj < numCols; jj += SSE_STRIDE) {
 a = _mm_loadu_ps(inptr);
 b = _mm_loadu_ps(inptr + 4);
 c = _mm_loadu_ps(inptr + 8);
 d = _mm_loadu_ps(inptr + 12);
 inptr += SSE_STRIDE;
 ai = _mm_cvtps_epi32(_mm_mul_ps(a, scalingFactor));
 bi = _mm_cvtps_epi32(_mm_mul_ps(b, scalingFactor));
 ci = _mm_cvtps_epi32(_mm_mul_ps(c, scalingFactor));
 di = _mm_cvtps_epi32(_mm_mul_ps(d, scalingFactor));
 p0 = _mm_packs_epi32(ai, bi),
 p1 = _mm_packs_epi32(ci, di);
 p0 = _mm_packus_epi16(p0, p1);
 _mm_storeu_si128(outptr++, p0);
 }
 }
}

I've also changed some thing to unsigned, because some useless sign-extension was going on sometimes (depending on how it was compiled). inptr and outptr are now calculated outside the inner loop, normally we can expect the compiler to be pretty clever about that too but compiling with OpenMP support seemed to make Clang less clever, it put the actual multiplication between ii and numColsPad inside the inner loop. That's not good, because on Intel that imul would go to p1, and p1 is already completely packed with the float-to-int conversions and fp-multiplications (eg on Skylake both of those go to p01). That loop should be able to run at 1 iteration every 4 cycles in the best case on most architectures, but with that extra imul in there that wouldn't be possible.

To convert to uint16 something similar can be used, but with _mm_packus_epi32 and no packing to bytes.

In the code above SSE_STRIDE is 4. Isn't inptr += 16 should be the way to move forward the pointer? Or maybe I miss something with the way you update the pointers.
@Royi oh I redefined it to be 16, should I not have done that? I thought you would use that elsewhere to determine the range of this loop
No problem. I just wanted to know the value you assume it to be. As I forgot to mention it.
Harold, Will your code round towards zero (fix) or to the nearest integer? How could I choose the rounding mode? Thank You.
@Royi with the usual MXCSR settings, _mm_cvtps_epi32 rounds to nearest (this can be changed with _mm_setcsr and should be restored afterwards). _mm_cvttps_epi32 truncates.

Stack Exchange Network

Converting Array of Floats to UINT8 (`char`) or UINT16 (`unsigned short`) Using SSE4

3 Answers 3

Get Rid of Unused Arguments

Naming

Performance

You must log in to answer this question.

Linked

Hot Network Questions