I have to implement a simple program in SSE, and I don't know if I have done it in the right way (it's my first program in SSE).
This is the C++ program:
double distance (int x1, int x2, int y1, int y2)
{
double x = abs(x2 - x1);
double y = abs(y2 - y1);
double dist = sqrt(x*x + y*y);
return dist;
}
And this is the program in SSE instructions:
__m128d vx1 _mm_load_pd(x1); // load x1 and x2
__m128d vx2 = _mm_load_pd(x2);
__m128d vdiff = _mm_sub_pd(vx2, vx1); // x2 - x1
__m128d vnegdiff = mm_sub_pd(_mm_set1_pd(0.0), vdiff); // 0.0 - diff
__m128d vabsX = _mm_max_pd(vdiff, vnegdiff); // abs = max(diff, - diff)
__m128d vy1 = _mm_load_pd(y1);
__m128d vy2 = _mm_load_pd(y2);
__m128d vdiff2 = _mm_sub_pd(vy2, vy1);
__m128d vnegdiff2 = mm_sub_pd(_mm_set1_pd(0.0), vdiff2);
__m128d vabsY = _mm_max_pd(vdiff2, vnegdiff2);
__m128d absX = _mm_mul_pd(vabsX,vabsX); //x*x
__m128d absY = _mm_mul_pd(vabsY,vabsY); //y*y
__m128d suma = _mm_add_pd(absX, absY); // x^2 + y^2
__m128d square = _mm_sqrt_pd(suma); // square root x^2 + y^2
return square;
2 Answers 2
Your SSE code looks like a reasonably accurate translation of your original C, but that C code doesn't immediately jump out as the best way to do things. In particular, I don't see where you gain anything by taking the absolute value before you do your multiplication. Given that you're squaring the value immediately afterwards anyway, it would appear simpler to just leave it negative if that's what it happens to be. Lacking a specific reason to do otherwise, I'd start with something simpler, like this:
double x = x1 - x2;
double y = y1 - y2;
double dist = sqrt(x * x + y * y);
Then I'd convert that to SSE instructions.
Especially if you're concerned with speed, you might want to consider using an alternative to the Pythagorean theorem to compute the hypotenuse though. I posted some C for one possibility in a previous answer.
Without SSE
- There is no need to take the absolute value. If X is real, then X^2 is always positive.
- Consider using hypot(x,y) rather than sqrt(xx+yy). It is simpler and more accurate.
My suggestions:
double distanceSqrt(int x1, int x2, int y1, int y2)
{
double dx = x2 - x1;
double dy = y2 - y1;
return sqrt(dx*dx + dy*dy);
}
double distanceHypot(int x1, int x2, int y1, int y2)
{
return hypot(x2 - x1, y2 - y1);
}
SSE
- What version of SSE are you targeting?
Rather than trying to correct your SSE version, I will build a new version. First I want to point out a few errors to try and clear up some misconceptions. Let's look at your first line of code:
__m128d vx1 = _mm_load_pd(x1);
A double is 64-bits. A __m128d is 128-bits. A __m128d holds two doubles. _mm_load_pd() takes a pointer to an array of doubles and loads the first two into a __m128d. You defined x1 to be of type int in your distance() function. The pointer passed to _mm_load_pd() must be 16 byte aligned or your program will crash. Another snippet:
__m128d absX = _mm_mul_pd(vabsX,vabsX);
__m128d absY = _mm_mul_pd(vabsY,vabsY);
You only need to call _mm_mul_pd() once. A __m128 holds two doubles so _mm_mul_pd() actually performs two multiplications.
//a={a1,a2}
//b={b1,b2}
//{a1*b1,a2*b2}=_mm_mul_pd(a,b)
double distance(int x1, int x2, int y1, int y2)
{
__declspec(align(16)) double diff[2] = {x2 - x1, y2 - y1};
__m128d temp = _mm_load_pd(diff);
temp = _mm_mul_pd(temp, temp);
//...
}
In this example the variable temp holds diff[0]*diff[0] in the low 8 bytes and diff[1]*diff[1] in the high 8 bytes.
Let's build a working SSE version of the distance() function.
double distanceSSE(int x1, int x2, int y1, int y2)
{
The first thing we need to do load the integers into SSE variables. A _m128i can hold four 32-bit integers, but we only need two.
__m128i point1 = _mm_set_epi32(0,0,x1,y1);
__m128i point2 = _mm_set_epi32(0,0,x2,y2);
We eventually need to convert these integers to doubles. We will wait because integer math is easier than double math. Also, the next operation is subtraction and there is no loss of information by sticking with integers. Find the distance on each plane.
__m128i diff = _mm_sub_epi32(point2,point1);
Next is multiplication which we can do with _mm_mul_epi32() without loss of information, but that creates two 64-bit integers. Converting 64-bit integers to doubles is harder than 32-bit integers to doubles so we will convert them now.
__m128d temp = _mm_cvtepi32_pd(diff);
The variable temp now contains two doubles {y2-y1,x2-x1}. Let's square them.
temp = _mm_mul_pd(temp,temp);
Now we need to add the two doubles in temp together before we take the square root. This time we use a horizontal add. We need a second argument so we just add a dummy.
temp = _mm_hadd_pd(temp,_mm_setzero_pd());
The square root.
temp = _mm_sqrt_pd(temp);
Finally, we copy the low double from temp into a double.
double ret;
_mm_storel_pd(&ret,temp);
return ret;
}
Put it all together:
double distanceSSE(int x1, int x2, int y1, int y2)
{
__m128i point1 = _mm_set_epi32(0, 0, x1, y1);
__m128i point2 = _mm_set_epi32(0, 0, x2, y2);
__m128i diff = _mm_sub_epi32(point2, point1);
__m128d temp = _mm_cvtepi32_pd(diff);
temp = _mm_mul_pd(temp, temp);
temp = _mm_hadd_pd(temp, _mm_setzero_pd());
temp = _mm_sqrt_pd(temp);
double ret;
_mm_storel_pd(&ret, temp);
return ret;
}
This is a simple method to find the distance but not necessarily the best. For more information on the drawbacks of this method see Hypot.