For my job I was tasked with writing a faster version of Arduino's map. The requirements were only use addition and subtraction, since these are cheap operations on our board.
For reference, here is Arduino's implementation:
long map(long x, long in_min, long in_max, long out_min, long out_max) {
return (x - in_min) * (out_max - out_min) / (in_max - in_min) + out_min;
}
And here's what I wrote
long fasterMap(long value, long fromLow, long fromHigh, long toLow, long toHigh) {
long first = value - fromLow;
long second = toHigh - toLow;
long combined = 0;
while (second != 0) { // Avoid usage of * operator
combined = combined + first;
second = second - 1;
}
long third = fromHigh - fromLow;
long count = 0;
while (combined >= third) { // Avoid usage of / operator
combined = combined - third;
count = count + 1;
}
count = count + toLow;
return count;
}
Is there any more performance or micro-optimizations I can make to have this function calculate even faster? Thanks!
Note: Because this is running on an Arduino, answers in pure c++
are greatly encouraged.
-
3\$\begingroup\$ Does your implementation outperform the reference one? \$\endgroup\$vnp– vnp2021年01月08日 04:27:13 +00:00Commented Jan 8, 2021 at 4:27
4 Answers 4
Well, I don't speak c++ but checked the provided link to Arduino's map()
method and there it states
The function also handles negative numbers well, so that this example
y = map(x, 1, 50, 50, -100);
is also valid and works well.
which your method wouldn't handle well, because with having toHigh = -100
of the fasterMap()
method would result in some really long running while
loop because second
in long second = toHigh - toLow;
would result in long second = -100 - 50
and therfor while(second != 0)
would never evaluate to true.
The division is the real killer since the simple CPU doesn't have a divide instruction. Even in CPUs that do, it is very slow.
If the denominator is known at compile time, the compiler can transform it into a multiplication or otherwise use tricks for that specific value (e.g. powers of two are just a shift).
You want all the in/out parameters to be known at compile time, and only x
is the real run-time parameter.
This is done using templates.
template <long in_min, long in_max, long out_min, long out_max>
struct map {
long operator()(long x) const {
return (x - in_min) * (out_max - out_min) / (in_max - in_min) + out_min;
}
}
Now you can use it like this:
map<12,42, 0,100> pbar_mapper;
⋮
long output = pbar_mapper(input);
That is, let the compiler figure out the fast trick for multiplying and dividing by these specific constants. With optimizations turned on, it should be at least as fast as any hand-written tricks.
update
Apparently, the compiler for Arduino doesn't replace division with reciprocal multiplication. Only divisions where the denominator is a power of 2 are replaced with simple operations.
The template form does cause the compiler to pre-compute the subexpressions out_max-out_min
and in_max-in_min
. A big improvement in generated code was seen by using int
instead of long
. See: https://godbolt.org/z/PPYTWvqff
If int
is 16 bits, and you know the values you are interested in will fit in this range, that will be the best benefit. So define the template using int
and add a static_assert
that (in_max-in_min)*(out_max-out_min)
does not exceed the size of an int
.
Better yet, detect that case and automatically do the intermediate calculation using long
and let the compiler know when it's multiplying two int
to get a long
and when it's dividing a long
by an int
, and hopefully it generates simpler code for the multiply and calls a faster form of the divide?
If you can choose the output range to be a power of two, that would eliminate the division.
Hmm and I notice that the "max" values in the original code shown are apparently meant to be exclusive, or the math is a little off and it just seems to work OK if the input
value is equal to in_max
. Since it's not doing rounding but truncation, a little imprecision might not be noticed.
Anyway, a range of 128 would correspond to parameters of 0,128
or 1,129
. If you're mapping to a nice range to pass along to other code, a range that's a power of two would be perfectly natural.
-
\$\begingroup\$ Nice trick making use of the compiler optimisation which is replacing the division by a multiplication with the reciprocal. This optimisation is available for most platforms, unfortunately it is not available for Arduino. \$\endgroup\$stefan– stefan2021年11月05日 16:54:03 +00:00Commented Nov 5, 2021 at 16:54
-
\$\begingroup\$ @stefan Yipes, I wonder why not? Playing with Compiler Explorer, I see it only optimizes/inlines when dividing by powers of two. I see multiplication does have a built-in instruction, and the compiler replaces it for some values and I guess it knows when the built-in is faster (having only single-bit shifts slows it down so
x*15
is not replaced). So, I wonder if you could manually generate the equivalent reciprocal multiplication expression? I'm guessing that maybe the double-width arithmetic is slow and that's why it's not helpful. \$\endgroup\$JDługosz– JDługosz2021年11月05日 20:38:36 +00:00Commented Nov 5, 2021 at 20:38
Faster very much depends on the use case. In any case, measure! You may even implement different strategies depending on your parameters.
mapping completely generic over the whole long
range
do a recursive bisect instead of your plain loop
long bisect_map(long x, long in_min, long in_max, long out_min, long out_max) {
if (in_min > in_max)
return bisect_map(x, in_max, in_min, out_max, out_min);
if (out_min == out_max)
return out_min;
const long in_mid = (in_min + in_max) >> 1;
const long out_mid = (out_min + out_max) >> 1;
if (in_min == in_mid)
return out_mid;
if (x <= in_mid)
return bisect_map(x, in_min, in_mid, out_min, out_mid);
else
return bisect_map(x, in_mid + 1, in_max, out_mid, out_max);
}
On all algorithms - use fixed point arithmetic to avoid precision loss
mapping different values on the same ranges
Precompute the division. Multiplication remains during runtime
class Mapper
{
public:
Mapper(long in_min, long in_max, long out_min, long out_max, long fp_length = 8)
: in_min(in_min)
, out_min(out_min)
, fp_length(fp_length)
, factor(((out_max - out_min) << fp_length) / (in_max - in_min)) { }
long operator()(long x) {
return ((((x - in_min) * factor) >> fp_length) + out_min);
}
private:
long in_min;
long out_min;
long fp_length;
long factor;
};
We have no detection of integer overflow (in either the reference implementation or the replacement). Given that we're using a signed type, any overflow would be Undefined Behaviour.
I'm surprised that your implementation turns out faster than the reference. If your compiler can't implement multiplication at least as fast as repeated addition, or division as fast as repeated subtraction, you need a better compiler.
A faster (and less overflow-prone) technique may be to use one of the standard straight-line-drawing algorithms, since we can visualise the rescaling operation as line-graph lookup.
-
2\$\begingroup\$ On Arduinos with an AVR CPU, it generates a
call _mulsi3
for the multiplication andcall _divmodsi4
for the division (see godbolt.org/z/ebdPTW). Funnily enough, the compiler sees that the firstwhile
-loop infasterMap()
is a multiplication and replaces it bycall _mulsi3
. It doesn't do so for the division. Possibly because OP's code doesn't handle division by zero, and as @Heslacher mentioned, it doesn't handle negative numbers very well. \$\endgroup\$G. Sliepen– G. Sliepen2021年01月08日 20:02:44 +00:00Commented Jan 8, 2021 at 20:02