The software that I develop uses large floating-point arrays up to the maximum size that can be allocated in C#. I have a large number of algorithms such as convolutions and filters that get executed over those large arrays. I am currently updating as many algorithms as possible to fully threaded and vectorized.
By utilizing the System.Numerics.Vector<T>
methods, I am seeing typically a 300%+ performance improvement in many of the algorithms on computers equipped with AVX (where Vector<float>.Count
returns 4) and a 600%+ performance improvement in many of the algorithms on computers equipped with AVX2 (where Vector<float>.Count
returns 8).
NET Standard 2.1 System.Numerics.Vector<T>
here:
https://docs.microsoft.com/en-us/dotnet/api/system.numerics.vector-1?view=netstandard-2.1
One of the functions that I require on a number of the algorithms is to Clamp the array element value to a bounds Minimum or Maximum value after performing some mathematical operation on it. That is of course really easy to do with the single-threaded and multi-threaded algorithms that use standard arithmetic operations.
The issue that I am having is that System.Numerics.Vector<T>
doesn't include any kind of Clamp method (Vector2, 3, 4 do). So, for example, if I loop over a large array, modifying the array in Vector<float>.Count
chunks, I need to clamp each vector result to a min and/or max bounds prior to writing that vector-sized chunk back to the array.
I tried doing the Clamp in a loop on the array chunk data after the Vector operations, but the performance is abysmal. It is as slow or slower than simply doing the algorithm without vectorization.
Is there any way that I can conceivably improve the performance of this Clamp method?
This is some typical code of how I tried clamping. I fill the vector with a chunk of the array, perform some vector math, write the chunk back to the array, this is all nice and speedy, but then Clamping the array chunk in a loop after just kills the vectorization performance advantage.
int length = array.Length;
int floatcount = System.Numerics.Vector<float>.Count;
for (int i = 0; i < length; i += floatcount)
{
System.Numerics.Vector<float> arrayvector = new System.Numerics.Vector<float>(array, i);
arrayvector = System.Numerics.Vector.Multiply<float>(arrayvector, 2.0f);
// There may be different or multiple vector operations in here.
arrayvector.CopyTo(array, i);
// This is how I tried clamping the array data after the vector operation:
for (int j = 0; j < floatcount; j++)
{
if (array[i + j] > maximimum) { array[i + j] = maximimum; }
}
}
I'm probably being myopic and missing something really simple. That's what months of 16-hour programming days gets you. ;) Thanks for any insight.
1 Answer 1
Have you tried something like this:
Create the following vector:
System.Numerics.Vector<float> maxima = new System.Numerics.Vector<float>(maximimum);
Then after the multiplication call:
arrayvector = System.Numerics.Vector.Min(arrayvector, maxima);
Here you may have to create an new vector instead of reassigning to arrayvector?
So all in all it ends up like:
int length = array.Length;
int floatcount = System.Numerics.Vector<float>.Count;
System.Numerics.Vector<float> maxima = new System.Numerics.Vector<float>(maximimum);
for (int i = 0; i < length; i += floatcount)
{
System.Numerics.Vector<float> arrayvector = new System.Numerics.Vector<float>(array, i);
arrayvector = System.Numerics.Vector.Multiply(arrayvector, 2.0f);
arrayvector = System.Numerics.Vector.Min(arrayvector, maxima);
// There may be different or multiple vector operations in here.
arrayvector.CopyTo(array, i);
}
Disclaimer: I haven't tested the above, so don't hang me if it's not an improvement :-)
-
2\$\begingroup\$ In addition to this, if you want to implement a double-ended clamp, you can use the min-max definition of a clamp:
clamp(a, b, x) = max(a, min(b, x))
, wherea
is the lower bound andb
is the upper bound. \$\endgroup\$EvilTak– EvilTak2019年05月10日 13:06:52 +00:00Commented May 10, 2019 at 13:06 -
1\$\begingroup\$ YES! You are awesome! :) It works perfect. I can't believe that I didn't think of that. I have only done one set of benchmark tests on one system, I will do a lot more profiling before I settle on the code, but initial tests look like this method will work fine. \$\endgroup\$deegee– deegee2019年05月10日 21:49:47 +00:00Commented May 10, 2019 at 21:49
-
1\$\begingroup\$ On my first quick profiling test, I am iterating over a floating-point array of 67,108,864 items, 268MB, and only performing a few math functions in the loop. On an AVX equipped system, release build, the standard single-threaded method takes 231310 ticks, while the vectorized method takes 104573 ticks. That is a better than doubling in performance. I will test it against the multi-threaded code and also on my AVX2 systems. AVX2 will hopefully be even better. I am running into memory bandwidth and cache performance issues at these speeds. :) \$\endgroup\$deegee– deegee2019年05月10日 21:54:39 +00:00Commented May 10, 2019 at 21:54
-
1\$\begingroup\$ There is one major issue that I have found in my work with the System.Numerics.Vectors methods, is never use Multiply<T>(Vector<T>, T), it is horribly slow. I don't know what they are doing in the code but it is the worst performing method I have tried. It is multiple times slower than non-vectored Multiply for each float. So I always use Multiply<T>(Vector<T>, Vector<T>) instead. \$\endgroup\$deegee– deegee2019年05月10日 21:59:22 +00:00Commented May 10, 2019 at 21:59
-
1\$\begingroup\$ @deegee it probably creates a new vector very time, but I can't seem to find the source code for these things... \$\endgroup\$VisualMelon– VisualMelon2019年05月11日 10:05:32 +00:00Commented May 11, 2019 at 10:05
Main
function? In questions about performance it's good when we could actually run and test it ourselves with a profiler or something - it'd be easier to compare the results and see whether the suggested improvement makes it really better ;-] \$\endgroup\$