I can see some performance increase in some scenarios, although maybe not as much as I expect, and I see no increase whatsoever in some other scenarios (e.g. the factorial calculation).
- Do you see any way to improve performance?
(Implementing Toom 2.5 was suggested by user greybeard in the now deleted comments, and I will do that for sure). - Do you have any other comments (style, naming of variables, etc)?
I can see some performance increase in some scenarios, although maybe not as much as I expect, and I see no increase whatsoever in some other scenarios (e.g. the factorial calculation).
- Do you see any way to improve performance?
(Implementing Toom 2.5 was suggested by user greybeard in the comments, and I will do that for sure). - Do you have any other comments (style, naming of variables, etc)?
- Do you see any way to improve performance?
(Implementing Toom 2.5 was suggested by user greybeard in the now deleted comments, and I will do that for sure). - Do you have any other comments (style, naming of variables, etc)?
- 211
- 1
- 5
- Do you see what may cause the poorany way to improve performance?
(Implementing Toom 2.5 was suggested by user greybeard in the comments, and I will do that for sure). - Do you have any other comments (style, naming of variables, etc)?
void bigint::bigint_t::mult(
limbs::iterator product, limbs::iterator product_end,
limbs::const_iterator a_begin, limbs::const_iterator a_end,
limbs::const_iterator b_begin, limbs::const_iterator b_end
) {
constexpr size_t karatsubaThreshold = 128,
toom3Threshold = 10000;2000;
ptrdiff_t a_len = std::distance(a_begin, a_end),
b_len = std::distance(b_begin, b_end),
calc_len = std::min(a_len, b_len);
if (calc_len >= toom3ThresholdkaratsubaThreshold && b_len * 3 < a_len * 62) >={
b_len // Unbalanced operand case (a >> b) -> Split the longer one to better use algorithms
auto a_mid = a_begin + b_len;
auto product_mid = product + (2 * 5b_len);
mult(product, product_mid, b_begin, b_end, a_begin, a_mid);
limbs tailResult; tailResult.resize(a_len, 0);
mult(tailResult.begin(), tailResult.end(), b_begin, b_end, a_mid, a_end);
addInto(product + b_len, tailResult.cbegin(), tailResult.cend());
}
else if (calc_len >= karatsubaThreshold && a_len * 53 <=< b_len * 62) {
// Unbalanced operand case (a << b) -> Split the longer one to better use algorithms
auto b_mid = b_begin + a_len;
auto product_mid = product + (2 * a_len);
mult(product, product_mid, a_begin, a_end, b_begin, b_mid);
limbs tailResult; tailResult.resize(b_len, 0);
mult(tailResult.begin(), tailResult.end(), a_begin, a_end, b_mid, b_end);
addInto(product + a_len, tailResult.cbegin(), tailResult.cend());
}
else if (calc_len >= toom3Threshold)
mult_toom3(product, product_end, a_begin, a_end, b_begin, b_end);
else if (calc_len >= karatsubaThreshold)
mult_karatsuba(product, product_end, a_begin, a_end, b_begin, b_end);
else
mult_vanilla(product, product_end, a_begin, a_end, b_begin, b_end);
}
// See algorithm at https://github.com/AtmoFX/bigint/blob/master/documentation/multiplication.md#Toom-Cook
// Notation as per the documentation: lowercase variable = strings of limbs; uppercase variable = polynomial
void bigint::bigint_t::mult_toom3(
limbs::iterator product_begin, limbs::iterator product_end,
limbs::const_iterator a_begin , limbs::const_iterator a_end,
limbs::const_iterator b_begin , limbs::const_iterator b_end
)
{
//#define MeasureTime
#ifdef MeasureTime
using std::chrono::steady_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
auto milestone1 = steady_clock::now();
#endif
size_t size_a = std::distance(a_begin, a_end),
size_b = std::distance(b_begin, b_end),
l = std::min(size_a, size_b),
l_third = (l + 2) / 3;
limbs::const_iterator
a1_begin = a_begin + l_third,
a2_begin = a1_begin + l_third,
b1_begin = b_begin + l_third,
b2_begin = b1_begin + l_third;
limbs::iterator
p0_begin = product_begin,
p1_begin = p0_begin + l_third,
p2_begin = p1_begin + l_third,
p3_begin = p2_begin + l_third,
p4_begin = p3_begin + l_third;
bigint_t a0, a1, a2, b0, b1, b2;
a0.value.resize(l_third); a1.value.resize(l_third); a2.value.resize(size_a - 2 * l_third);
b0.value.resize(l_third); b1.value.resize(l_third); b2.value.resize(size_b - 2 * l_third);
std::copy(a_begin, a1_begin, a0.value.begin()); std::copy(a1_begin, a2_begin, a1.value.begin()); std::copy(a2_begin, a_end, a2.value.begin());
std::copy(b_begin, b1_begin, b0.value.begin()); std::copy(b1_begin, b2_begin, b1.value.begin()); std::copy(b2_begin, b_end, b2.value.begin());
#ifdef MeasureTime
auto milestone2 = steady_clock::now();
#endif
bigint_t va_1 = a2 + a0 - a1 , vb_1 = b2 + b0 - b1,
va1 = a2 + a1 + a0 , vb1 = b2 + b1 + b0,
va2 = (a2 << 2) + (a1 << 1) + a0, vb2 = (b2 << 2) + (b1 << 1) + b0;
#ifdef MeasureTime
auto milestone3 = steady_clock::now();
#endif
//P(-1), P(1) and P(2)
bigint_t P_1 = va_1 * vb_1,
P1 = va1 * vb1,
P2 = va2 * vb2;
// p0 = P(0) and p4 = P(infinity)
bigint_t p0 = a0 * b0,
p4 = a2 * b2;
#ifdef MeasureTime
auto milestone4 = steady_clock::now();
#endif
bigint_t p0_plus_p4 = p0 + p4,
p2 = ((P_1 + P1) >> 1) - p0_plus_p4,
p3 = (p0 - 14 * p4 + P2 - ((p2 + P1) << 1)) / 6,
p1 = P1 - (p0_plus_p4 + p2 + p3);
addInto(p0_begin, p0.value.cbegin(), p0.value.cend());
addInto(p1_begin, p1.value.cbegin(), p1.value.cend());
addInto(p2_begin, p2.value.cbegin(), p2.value.cend());
addInto(p3_begin, p3.value.cbegin(), p3.value.cend());
addInto(p4_begin, p4.value.cbegin(), p4.value.cend());
#ifdef MeasureTime
auto milestone5 = steady_clock::now();
std::cerr
<< "Memory allocation + copy : " << duration_cast<milliseconds>(milestone2 - milestone1).count() << " ms\n"
<< "Variable calculation (+/-): " << duration_cast<milliseconds>(milestone3 - milestone2).count() << " ms\n"
<< "Recursion : " << duration_cast<milliseconds>(milestone4 - milestone3).count() << " ms\n"
<< "Equation resolution : " << duration_cast<milliseconds>(milestone5 - milestone4).count() << " ms\n";
#undef MeasureTime
#endif
}
- Squaring of a number.
Operands have the same size and there is definitely a performance gain. - Factorial calculation.
The factorial algorithm is designed to multiply big operands together (it does not do((((1 * 2) * 3) * 4) ...
) but the operands have mismatching sizes nonetheless (the biggerit spends a non-negligible amount of time doing multiplications with at least one operand ends up being about 15% longer thanunder the shorter operand)Toom-3 threshold.
Although it is close to impossible to tell how much, we expect - and since the performance gainlast edit, there is all but obvious (it even looks like a- some performance loss on average)gain.
Note that a lot of smaller operands are multiplied together during the factorial calculation (it happens behind the scenes) so if the squaring of a number was about 20% or so faster with Toom-3, this is not necessarily a ratio we should expect for the factorial. It is probably more relevant to measure an improvement in time difference than in time ratio here; all the more so when multiplying larger numbers together.
#include <bigint.h>
#include <chrono>
#include <iostream>
int main(int argc, char* argv[])
{
using std::chrono::steady_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
{
//p is not big enough to use Toom-3 multiplication due to hardcoded threshold.
auto p = bigint::power(453256, 25000);
std::cerr << "Starting multiplication...\n";
auto start = steady_clock::now();
//s and q will use 1 round of Toom-3 multiplication each.
auto s = p * p;
auto q = s * s;
auto end = steady_clock::now();
std::cerr << "Total multiplication time: " << duration_cast<milliseconds>(end - start).count() << " ms\n";
std::cout << p.toString() << '\n';
//return 0;
}
std::cerr << "\n\n\n";
{
std::cerr << "Starting factorial calculation...\n";
auto start = steady_clock::now();
// f uses Toom-3 for its calculation.
auto f = bigint::factorial(200000);
auto end = steady_clock::now();
std::cerr << "Total multiplication time: " << duration_cast<milliseconds>(end - start).count() << " ms\n";
std::cout << f.toString() << '\n';
return 0;
}
}
Here is the output I get out ofEdit: With comments provided to my main()
for the integer squaringquestion, showing a gain in performance thanksI have managed to create some performance gains since the Toom-3 algorithm:original post. Below are the measurements made for the multiplication and the factorial function now.
This is most likely not fully optimized yet.
For the multiplication case (453,25625,000 squared, then squared again):
- When Toom-3 is commented out (i.e.
mult_karatsuba
is called in every scenario wheremult_toom3
should have been called):Starting multiplication... Total multiplication time: 141 ms
Starting multiplication... Total multiplication time: 141 ms
- When Toom-3 is not commented:
Starting multiplication... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 27 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 80 ms Equation resolution : 1 ms Total multiplication time: 115 ms
Total multiplication time: 88 ms
Now, here is the output I get forFor the factorial calculation. I have run it 3 times in both scenarios and the best performing occurrence of Toom-3 is comparable to the worst performing occurrence without it (this is consistent with my previous observations200,000!):
- When Toom-3 is commented (Worst of 3 runs):
Starting factorial calculation... Total multiplication time: 3811 ms
- When Toom-3 is not commented (Best and worst of 3 runs):
Starting factorial calculation... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 51 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 218 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 152 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 23 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 507 ms Equation resolution : 2 ms Total multiplication time: 3799 ms
Starting factorial calculation... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 64 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 200 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 154 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 23 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 507 ms Equation resolution : 2 ms Total multiplication time: 39903021 ms
To be clear, a few of the above observations are confusing me:
The first test case (multiplication) shows a performance gain even if
mult_toom3
is called once (the operands' value was kept relatively small for that purpose: no more than 1 call per multiplication, i.e. 2 calls in total).The threshold may be so high that it does not leverage as much performance gain as was possible (more than 99% of the time spent in Toom-3 is in the recursion; the additional less complex operations take almost nothing) and indeed, decreasing it to 2000 allows the product to be completed faster.
But again, the goal here was to test the algorithm in a number range where toom-3 is unquestionably faster than karatsuba.
200k! takes the same amount of time with or without
mult_toom3
but you could edit the code to confirm 500k! or even 1M! do as well... To me, it looks as if the complexity for Toom-3 was the same as for Karatsuba.Note that if 200K! was to take the same time in both scenarios by sheer coincidence, then I would expect 1M! (with over 5✕ more digits) to show a clear difference, which it does not.
Last but not least, the gain I measure for 453256100k is not reproduced for 1M!, although 1M! is considerably bigger (more than 5 million digits) than 453256100k (only 565.6 thousand digits).
I fail to understand why it appears that my Toom-3 algorithm has the same complexity as Karatsuba.
- Do you see what may cause the poor performance?
- Do you have any other comments (style, naming of variables, etc)?
void bigint::bigint_t::mult(
limbs::iterator product, limbs::iterator product_end,
limbs::const_iterator a_begin, limbs::const_iterator a_end,
limbs::const_iterator b_begin, limbs::const_iterator b_end
) {
constexpr size_t karatsubaThreshold = 128,
toom3Threshold = 10000;
ptrdiff_t a_len = std::distance(a_begin, a_end),
b_len = std::distance(b_begin, b_end),
calc_len = std::min(a_len, b_len);
if (calc_len >= toom3Threshold && a_len * 6 >= b_len * 5 && a_len * 5 <= b_len * 6)
mult_toom3(product, product_end, a_begin, a_end, b_begin, b_end);
else if (calc_len >= karatsubaThreshold)
mult_karatsuba(product, product_end, a_begin, a_end, b_begin, b_end);
else
mult_vanilla(product, product_end, a_begin, a_end, b_begin, b_end);
}
// See algorithm at https://github.com/AtmoFX/bigint/blob/master/documentation/multiplication.md#Toom-Cook
// Notation as per the documentation: lowercase variable = strings of limbs; uppercase variable = polynomial
void bigint::bigint_t::mult_toom3(
limbs::iterator product_begin, limbs::iterator product_end,
limbs::const_iterator a_begin , limbs::const_iterator a_end,
limbs::const_iterator b_begin , limbs::const_iterator b_end
)
{
//#define MeasureTime
#ifdef MeasureTime
using std::chrono::steady_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
auto milestone1 = steady_clock::now();
#endif
size_t size_a = std::distance(a_begin, a_end),
size_b = std::distance(b_begin, b_end),
l = std::min(size_a, size_b),
l_third = (l + 2) / 3;
limbs::const_iterator
a1_begin = a_begin + l_third,
a2_begin = a1_begin + l_third,
b1_begin = b_begin + l_third,
b2_begin = b1_begin + l_third;
limbs::iterator
p0_begin = product_begin,
p1_begin = p0_begin + l_third,
p2_begin = p1_begin + l_third,
p3_begin = p2_begin + l_third,
p4_begin = p3_begin + l_third;
bigint_t a0, a1, a2, b0, b1, b2;
a0.value.resize(l_third); a1.value.resize(l_third); a2.value.resize(size_a - 2 * l_third);
b0.value.resize(l_third); b1.value.resize(l_third); b2.value.resize(size_b - 2 * l_third);
std::copy(a_begin, a1_begin, a0.value.begin()); std::copy(a1_begin, a2_begin, a1.value.begin()); std::copy(a2_begin, a_end, a2.value.begin());
std::copy(b_begin, b1_begin, b0.value.begin()); std::copy(b1_begin, b2_begin, b1.value.begin()); std::copy(b2_begin, b_end, b2.value.begin());
#ifdef MeasureTime
auto milestone2 = steady_clock::now();
#endif
bigint_t va_1 = a2 + a0 - a1 , vb_1 = b2 + b0 - b1,
va1 = a2 + a1 + a0 , vb1 = b2 + b1 + b0,
va2 = (a2 << 2) + (a1 << 1) + a0, vb2 = (b2 << 2) + (b1 << 1) + b0;
#ifdef MeasureTime
auto milestone3 = steady_clock::now();
#endif
//P(-1), P(1) and P(2)
bigint_t P_1 = va_1 * vb_1,
P1 = va1 * vb1,
P2 = va2 * vb2;
// p0 = P(0) and p4 = P(infinity)
bigint_t p0 = a0 * b0,
p4 = a2 * b2;
#ifdef MeasureTime
auto milestone4 = steady_clock::now();
#endif
bigint_t p0_plus_p4 = p0 + p4,
p2 = ((P_1 + P1) >> 1) - p0_plus_p4,
p3 = (p0 - 14 * p4 + P2 - ((p2 + P1) << 1)) / 6,
p1 = P1 - (p0_plus_p4 + p2 + p3);
addInto(p0_begin, p0.value.cbegin(), p0.value.cend());
addInto(p1_begin, p1.value.cbegin(), p1.value.cend());
addInto(p2_begin, p2.value.cbegin(), p2.value.cend());
addInto(p3_begin, p3.value.cbegin(), p3.value.cend());
addInto(p4_begin, p4.value.cbegin(), p4.value.cend());
#ifdef MeasureTime
auto milestone5 = steady_clock::now();
std::cerr
<< "Memory allocation + copy : " << duration_cast<milliseconds>(milestone2 - milestone1).count() << " ms\n"
<< "Variable calculation (+/-): " << duration_cast<milliseconds>(milestone3 - milestone2).count() << " ms\n"
<< "Recursion : " << duration_cast<milliseconds>(milestone4 - milestone3).count() << " ms\n"
<< "Equation resolution : " << duration_cast<milliseconds>(milestone5 - milestone4).count() << " ms\n";
#undef MeasureTime
#endif
}
- Squaring of a number.
Operands have the same size and there is definitely a performance gain. - Factorial calculation.
The factorial algorithm is designed to multiply big operands together (it does not do((((1 * 2) * 3) * 4) ...
) but the operands have mismatching sizes nonetheless (the bigger operand ends up being about 15% longer than the shorter operand) and the performance gain is all but obvious (it even looks like a performance loss on average).
Note that a lot of smaller operands are multiplied together during the factorial calculation (it happens behind the scenes) so if the squaring of a number was about 20% or so faster with Toom-3, this is not necessarily a ratio we should expect for the factorial. It is probably more relevant to measure an improvement in time difference than in time ratio here; all the more so when multiplying larger numbers together.
#include <bigint.h>
#include <chrono>
#include <iostream>
int main(int argc, char* argv[])
{
using std::chrono::steady_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
{
//p is not big enough to use Toom-3 multiplication due to hardcoded threshold.
auto p = bigint::power(453256, 25000);
std::cerr << "Starting multiplication...\n";
auto start = steady_clock::now();
//s and q will use 1 round of Toom-3 multiplication each.
auto s = p * p;
auto q = s * s;
auto end = steady_clock::now();
std::cerr << "Total multiplication time: " << duration_cast<milliseconds>(end - start).count() << " ms\n";
std::cout << p.toString() << '\n';
//return 0;
}
std::cerr << "\n\n\n";
{
std::cerr << "Starting factorial calculation...\n";
auto start = steady_clock::now();
// f uses Toom-3 for its calculation.
auto f = bigint::factorial(200000);
auto end = steady_clock::now();
std::cerr << "Total multiplication time: " << duration_cast<milliseconds>(end - start).count() << " ms\n";
std::cout << f.toString() << '\n';
return 0;
}
}
Here is the output I get out of my main()
for the integer squaring, showing a gain in performance thanks to the Toom-3 algorithm:
- When Toom-3 is commented out (i.e.
mult_karatsuba
is called in every scenario wheremult_toom3
should have been called):Starting multiplication... Total multiplication time: 141 ms
- When Toom-3 is not commented:
Starting multiplication... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 27 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 80 ms Equation resolution : 1 ms Total multiplication time: 115 ms
Now, here is the output I get for the factorial calculation. I have run it 3 times in both scenarios and the best performing occurrence of Toom-3 is comparable to the worst performing occurrence without it (this is consistent with my previous observations):
- When Toom-3 is commented (Worst of 3 runs):
Starting factorial calculation... Total multiplication time: 3811 ms
- When Toom-3 is not commented (Best and worst of 3 runs):
Starting factorial calculation... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 51 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 218 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 152 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 23 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 507 ms Equation resolution : 2 ms Total multiplication time: 3799 ms
Starting factorial calculation... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 64 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 200 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 154 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 23 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 507 ms Equation resolution : 2 ms Total multiplication time: 3990 ms
To be clear, a few of the above observations are confusing me:
The first test case (multiplication) shows a performance gain even if
mult_toom3
is called once (the operands' value was kept relatively small for that purpose: no more than 1 call per multiplication, i.e. 2 calls in total).The threshold may be so high that it does not leverage as much performance gain as was possible (more than 99% of the time spent in Toom-3 is in the recursion; the additional less complex operations take almost nothing) and indeed, decreasing it to 2000 allows the product to be completed faster.
But again, the goal here was to test the algorithm in a number range where toom-3 is unquestionably faster than karatsuba.
200k! takes the same amount of time with or without
mult_toom3
but you could edit the code to confirm 500k! or even 1M! do as well... To me, it looks as if the complexity for Toom-3 was the same as for Karatsuba.Note that if 200K! was to take the same time in both scenarios by sheer coincidence, then I would expect 1M! (with over 5✕ more digits) to show a clear difference, which it does not.
Last but not least, the gain I measure for 453256100k is not reproduced for 1M!, although 1M! is considerably bigger (more than 5 million digits) than 453256100k (only 565.6 thousand digits).
I fail to understand why it appears that my Toom-3 algorithm has the same complexity as Karatsuba.
- Do you see any way to improve performance?
(Implementing Toom 2.5 was suggested by user greybeard in the comments, and I will do that for sure). - Do you have any other comments (style, naming of variables, etc)?
void bigint::bigint_t::mult(
limbs::iterator product, limbs::iterator product_end,
limbs::const_iterator a_begin, limbs::const_iterator a_end,
limbs::const_iterator b_begin, limbs::const_iterator b_end
) {
constexpr size_t karatsubaThreshold = 128,
toom3Threshold = 2000;
ptrdiff_t a_len = std::distance(a_begin, a_end),
b_len = std::distance(b_begin, b_end),
calc_len = std::min(a_len, b_len);
if (calc_len >= karatsubaThreshold && b_len * 3 < a_len * 2) {
// Unbalanced operand case (a >> b) -> Split the longer one to better use algorithms
auto a_mid = a_begin + b_len;
auto product_mid = product + (2 * b_len);
mult(product, product_mid, b_begin, b_end, a_begin, a_mid);
limbs tailResult; tailResult.resize(a_len, 0);
mult(tailResult.begin(), tailResult.end(), b_begin, b_end, a_mid, a_end);
addInto(product + b_len, tailResult.cbegin(), tailResult.cend());
}
else if (calc_len >= karatsubaThreshold && a_len * 3 < b_len * 2) {
// Unbalanced operand case (a << b) -> Split the longer one to better use algorithms
auto b_mid = b_begin + a_len;
auto product_mid = product + (2 * a_len);
mult(product, product_mid, a_begin, a_end, b_begin, b_mid);
limbs tailResult; tailResult.resize(b_len, 0);
mult(tailResult.begin(), tailResult.end(), a_begin, a_end, b_mid, b_end);
addInto(product + a_len, tailResult.cbegin(), tailResult.cend());
}
else if (calc_len >= toom3Threshold)
mult_toom3(product, product_end, a_begin, a_end, b_begin, b_end);
else if (calc_len >= karatsubaThreshold)
mult_karatsuba(product, product_end, a_begin, a_end, b_begin, b_end);
else
mult_vanilla(product, product_end, a_begin, a_end, b_begin, b_end);
}
// See algorithm at https://github.com/AtmoFX/bigint/blob/master/documentation/multiplication.md#Toom-Cook
// Notation as per the documentation: lowercase variable = strings of limbs; uppercase variable = polynomial
void bigint::bigint_t::mult_toom3(
limbs::iterator product_begin, limbs::iterator product_end,
limbs::const_iterator a_begin , limbs::const_iterator a_end,
limbs::const_iterator b_begin , limbs::const_iterator b_end
)
{
//#define MeasureTime
#ifdef MeasureTime
using std::chrono::steady_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
auto milestone1 = steady_clock::now();
#endif
size_t size_a = std::distance(a_begin, a_end),
size_b = std::distance(b_begin, b_end),
l = std::min(size_a, size_b),
l_third = (l + 2) / 3;
limbs::const_iterator
a1_begin = a_begin + l_third,
a2_begin = a1_begin + l_third,
b1_begin = b_begin + l_third,
b2_begin = b1_begin + l_third;
limbs::iterator
p0_begin = product_begin,
p1_begin = p0_begin + l_third,
p2_begin = p1_begin + l_third,
p3_begin = p2_begin + l_third,
p4_begin = p3_begin + l_third;
bigint_t a0, a1, a2, b0, b1, b2;
a0.value.resize(l_third); a1.value.resize(l_third); a2.value.resize(size_a - 2 * l_third);
b0.value.resize(l_third); b1.value.resize(l_third); b2.value.resize(size_b - 2 * l_third);
std::copy(a_begin, a1_begin, a0.value.begin()); std::copy(a1_begin, a2_begin, a1.value.begin()); std::copy(a2_begin, a_end, a2.value.begin());
std::copy(b_begin, b1_begin, b0.value.begin()); std::copy(b1_begin, b2_begin, b1.value.begin()); std::copy(b2_begin, b_end, b2.value.begin());
#ifdef MeasureTime
auto milestone2 = steady_clock::now();
#endif
bigint_t va_1 = a2 + a0 - a1 , vb_1 = b2 + b0 - b1,
va1 = a2 + a1 + a0 , vb1 = b2 + b1 + b0,
va2 = (a2 << 2) + (a1 << 1) + a0, vb2 = (b2 << 2) + (b1 << 1) + b0;
#ifdef MeasureTime
auto milestone3 = steady_clock::now();
#endif
//P(-1), P(1) and P(2)
bigint_t P_1 = va_1 * vb_1,
P1 = va1 * vb1,
P2 = va2 * vb2;
// p0 = P(0) and p4 = P(infinity)
bigint_t p0 = a0 * b0,
p4 = a2 * b2;
#ifdef MeasureTime
auto milestone4 = steady_clock::now();
#endif
bigint_t p0_plus_p4 = p0 + p4,
p2 = ((P_1 + P1) >> 1) - p0_plus_p4,
p3 = (p0 - 14 * p4 + P2 - ((p2 + P1) << 1)) / 6,
p1 = P1 - (p0_plus_p4 + p2 + p3);
addInto(p0_begin, p0.value.cbegin(), p0.value.cend());
addInto(p1_begin, p1.value.cbegin(), p1.value.cend());
addInto(p2_begin, p2.value.cbegin(), p2.value.cend());
addInto(p3_begin, p3.value.cbegin(), p3.value.cend());
addInto(p4_begin, p4.value.cbegin(), p4.value.cend());
#ifdef MeasureTime
auto milestone5 = steady_clock::now();
std::cerr
<< "Memory allocation + copy : " << duration_cast<milliseconds>(milestone2 - milestone1).count() << " ms\n"
<< "Variable calculation (+/-): " << duration_cast<milliseconds>(milestone3 - milestone2).count() << " ms\n"
<< "Recursion : " << duration_cast<milliseconds>(milestone4 - milestone3).count() << " ms\n"
<< "Equation resolution : " << duration_cast<milliseconds>(milestone5 - milestone4).count() << " ms\n";
#undef MeasureTime
#endif
}
- Squaring of a number.
Operands have the same size and there is definitely a performance gain. - Factorial calculation.
The factorial algorithm is designed to multiply big operands together (it does not do((((1 * 2) * 3) * 4) ...
) but it spends a non-negligible amount of time doing multiplications with at least one operand under the Toom-3 threshold.
Although it is close to impossible to tell how much, we expect - and since the last edit, there is - some performance gain.
#include <bigint.h>
#include <chrono>
#include <iostream>
int main(int argc, char* argv[])
{
using std::chrono::steady_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
{
auto p = bigint::power(453256, 25000);
std::cerr << "Starting multiplication...\n";
auto start = steady_clock::now();
//s and q will use 1 round of Toom-3 multiplication each.
auto s = p * p;
auto q = s * s;
auto end = steady_clock::now();
std::cerr << "Total multiplication time: " << duration_cast<milliseconds>(end - start).count() << " ms\n";
std::cout << p.toString() << '\n';
//return 0;
}
std::cerr << "\n\n\n";
{
std::cerr << "Starting factorial calculation...\n";
auto start = steady_clock::now();
// f uses Toom-3 for its calculation.
auto f = bigint::factorial(200000);
auto end = steady_clock::now();
std::cerr << "Total multiplication time: " << duration_cast<milliseconds>(end - start).count() << " ms\n";
std::cout << f.toString() << '\n';
return 0;
}
}
Edit: With comments provided to my question, I have managed to create some performance gains since the original post. Below are the measurements made for the multiplication and the factorial function now.
This is most likely not fully optimized yet.
For the multiplication case (453,25625,000 squared, then squared again):
- When Toom-3 is commented out (i.e.
mult_karatsuba
is called in every scenario wheremult_toom3
should have been called):Starting multiplication... Total multiplication time: 141 ms
- When Toom-3 is not commented:
Total multiplication time: 88 ms
For the factorial (200,000!)
- When Toom-3 is commented:
Starting factorial calculation... Total multiplication time: 3811 ms
- When Toom-3 is not commented:
Starting factorial calculation... Total multiplication time: 3021 ms
Toom-3 big-integer multiplication algorithm underperforming?
The Toom-3 algorithm is resisting me a little bit. AFAICT, it does produce the right results but not as fast as I expected.
I can see some performance increase in some scenarios, although maybe not as much as what I should expect, and I see no increase whatsoever in some other scenarios (e.g. the factorial calculation).
I can see some performance increase in some scenarios, although maybe not as much as I expect, and I see no increase whatsoever in some other scenarios (e.g. the factorial calculation).
- Do you see what may cause for the poor performance?
- Do you have any other commentcomments (style, naming of variables, etc)?
mult
is in charge of picking what it thinks is the most suitable algorithm given the operands' size; 3 algorithms are implemented so far (if you are unfamiliar with them, you can get some introduction in the documentation I dropped on the linklinked repository).
Note that I have set a high threshold. At such value, the benefit of callingmult_toom3
overmult_karatsuba
should be visible even if called only once, and then never again in the recursion).
Note that I have set a high threshold. At such value, the benefit of calling mult_toom3
over mult_karatsuba
should be visible even if called only once, and then never again in the recursion.
mult_toom3
performs Toom-3 calculation, callingmult
recursively.
In the below version offollowing mult_toom3()
implementation, I have added some time measures (when //#define MeasureTime
is uncommented) to show where processing time is being spent.
-> Almost Almost all of it is in the recursion (and none of the time in the calculations around).
If you want to test for yourselves, you'll need to download the full source available on my depositorymy repository (feel free to star the project while you are at it) and e.g. use the belowa main()
function such as the one below.
- Squaring of a number.
Operands have the same size and there is definitely a performance gain. - Factorial calculation.
The factorial algorithm is designed to multiply big operands together (it does not do((((1 * 2) * 3) * 4) ...
) but the operands have mismatching sizes nonetheless (the bigger operand ends up being about 15% longer than the shorter operand) and the performance gain is all but obvious (it even looks like a performance loss inon average).
Note that a lot of smaller operands are multiplied together during the factorial calculation (it happens behind the scenescenes) so if the squaring of a number was about 20% or so faster with Toom-3, this is not necessarily a ratio we should expect for the factorial. It is probably more relevant to measure an improvement in time difference than in time ratio here; all the more so when multiplying larger numbers together.
Here is the output I get out of my main()
for the integer squaring, showing a gain in performance thanks to the Toom-3 algorithm:
- When Toom-3 is commented out (i.e.
mult_karatsuba
is called in every scenario wheremult_toom3
should have been called):Starting multiplication... Total multiplication time: 141 ms
Starting multiplication... Total multiplication time: 141 ms
- When Toom-3 is not commented:
Starting multiplication... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 27 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 80 ms Equation resolution : 1 ms Total multiplication time: 115 ms
Starting multiplication... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 27 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 80 ms Equation resolution : 1 ms Total multiplication time: 115 ms
- When Toom-3 is commented (Worst of 3 runs):
Starting factorial calculation... Total multiplication time: 3811 ms
Starting factorial calculation... Total multiplication time: 3811 ms
- When Toom-3 is not commented (Best and worst of 3 runs):
Starting factorial calculation... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 51 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 218 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 152 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 23 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 507 ms Equation resolution : 2 ms Total multiplication time: 3799 ms
Starting factorial calculation... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 51 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 218 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 152 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 23 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 507 ms Equation resolution : 2 ms Total multiplication time: 3799 ms
Starting factorial calculation... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 64 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 200 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 154 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 23 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 507 ms Equation resolution : 2 ms Total multiplication time: 3990 ms
Starting factorial calculation... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 64 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 200 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 154 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 23 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 507 ms Equation resolution : 2 ms Total multiplication time: 3990 ms
Edit:
- The first test case (multiplication) shows a performance gain even if
mult_toom3
is called once (the operands' value was kept relatively small for that purpose: no more than 1 call per multiplication, i.e. 2 calls in total).The first test case (multiplication) shows a performance gain even if
The threshold may be so high it does not leverage as much performance gain as was possible (more than 99% of the time spent in Toom-3 is in the recursion; the additional less complex operations take almost nothing.) and indeed, decreasing it to 2000 allows the product to be completed faster.mult_toom3
is called once (the operands' value was kept relatively small for that purpose: no more than 1 call per multiplication, i.e. 2 calls in total).
But again, the goal here was to test the algorithm in a number range where toom-3 is unquestionably faster than karatsuba.The threshold may be so high that it does not leverage as much performance gain as was possible (more than 99% of the time spent in Toom-3 is in the recursion; the additional less complex operations take almost nothing) and indeed, decreasing it to 2000 allows the product to be completed faster.
But again, the goal here was to test the algorithm in a number range where toom-3 is unquestionably faster than karatsuba.
- 200k! takes the same amount of time be it with or without
mult_toom3
but you could edit the code to confirm 500k! or even 1M! do as well...200k! takes the same amount of time with or without
To me, it looks as if the complexity for Toom-3 was the same as for Karatsuba.mult_toom3
but you could edit the code to confirm 500k! or even 1M! do as well... To me, it looks as if the complexity for Toom-3 was the same as for Karatsuba.Note that if 200K! was to take the same time in both scenarios by sheer coincidence, then I would expect 1M! (with over 5✕ more digits) to show a clear difference, which it does not.
Note that if 200K! was to take the same time in both scenarios by sheer coincidence, then I would expect 1M! (more than 5x more digits) to show a clear difference, which it does not. - Last but not least, the gain I measure for 453256100k is not reproduced for 1M!, although 1M! is considerably bigger (more than 5 million digits) than 453256100k ("only 565.6 thousand digits)
Last but not least, the gain I measure for 453256100k is not reproduced for 1M!, although 1M! is considerably bigger (more than 5 million digits) than 453256100k (only 565.6 thousand digits).
-> I fail to understand why it looks likeappears that my Toom-3 algorithm looks like it has the same complexity as Karatsuba.
Toom-3 algorithm underperforming?
The Toom-3 algorithm is resisting me a little bit. AFAICT, it does produce the right results but not as fast as I expected.
I can see some performance increase in some scenarios, although maybe not as much as what I should expect, and I see no increase whatsoever in some other scenarios (e.g. the factorial calculation).
- Do you see what may cause for the poor performance?
- Do you have any other comment (style, naming of variables, etc)?
mult
is in charge of picking what it thinks is the most suitable algorithm given the operands' size; 3 algorithms are implemented so far (if you are unfamiliar with them, you can get some introduction in the documentation I dropped on the link repository).
Note that I have set a high threshold. At such value, the benefit of callingmult_toom3
overmult_karatsuba
should be visible even if called only once, and then never again in the recursion).mult_toom3
performs Toom-3 calculation, callingmult
recursively.
In the below version of mult_toom3
, I have added some time measures (when //#define MeasureTime
is uncommented) to show where processing time is being spent.
-> Almost all of it is in the recursion (and none of the time in the calculations around).
If you want to test for yourselves, you'll need to download the full source available on my depository (feel free to star the project while you are at it) and e.g. use the below main
function.
- Squaring of a number. Operands have the same size and there is definitely a performance gain.
- Factorial calculation.
The factorial algorithm is designed to multiply big operands together (it does not do
((((1 * 2) * 3) * 4) ...
) but the operands have mismatching sizes nonetheless (the bigger operand ends up being about 15% longer than the shorter operand) and the performance gain is all but obvious (it even looks like a performance loss in average).
Note that a lot of smaller operands are multiplied together during the factorial calculation (it happens behind the scene) so if the squaring of a number was about 20% or so faster with Toom-3, this is not necessarily a ratio we should expect for the factorial. It is probably more relevant to measure an improvement in time difference than in time ratio here; all the more so when multiplying larger numbers together.
Here is the output I get out of my main
for the integer squaring, showing a gain in performance thanks to the Toom-3 algorithm:
- When Toom-3 is commented (i.e.
mult_karatsuba
is called in every scenario wheremult_toom3
should have been called):Starting multiplication... Total multiplication time: 141 ms
- When Toom-3 is not commented:
Starting multiplication... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 27 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 80 ms Equation resolution : 1 ms Total multiplication time: 115 ms
- When Toom-3 is commented (Worst of 3 runs):
Starting factorial calculation... Total multiplication time: 3811 ms
- When Toom-3 is not commented (Best and worst of 3 runs):
Starting factorial calculation... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 51 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 218 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 152 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 23 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 507 ms Equation resolution : 2 ms Total multiplication time: 3799 ms
Starting factorial calculation... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 64 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 200 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 154 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 23 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 507 ms Equation resolution : 2 ms Total multiplication time: 3990 ms
Edit:
- The first test case (multiplication) shows a performance gain even if
mult_toom3
is called once (the operands' value was kept relatively small for that purpose: no more than 1 call per multiplication, i.e. 2 calls in total).
The threshold may be so high it does not leverage as much performance gain as was possible (more than 99% of the time spent in Toom-3 is in the recursion; the additional less complex operations take almost nothing.) and indeed, decreasing it to 2000 allows the product to be completed faster.
But again, the goal here was to test the algorithm in a number range where toom-3 is unquestionably faster than karatsuba. - 200k! takes the same amount of time be it with or without
mult_toom3
but you could edit the code to confirm 500k! or even 1M! do as well... To me, it looks as if the complexity for Toom-3 was the same as for Karatsuba.
Note that if 200K! was to take the same time in both scenarios by sheer coincidence, then I would expect 1M! (more than 5x more digits) to show a clear difference, which it does not. - Last but not least, the gain I measure for 453256100k is not reproduced for 1M!, although 1M! is considerably bigger (more than 5 million digits) than 453256100k ("only 565.6 thousand digits)
-> I fail to understand why it looks like my Toom-3 algorithm looks like it has the same complexity as Karatsuba.
Toom-3 big-integer multiplication algorithm
The Toom-3 algorithm is resisting me a little bit. AFAICT, it does produce the right results but not as fast as I expected.
I can see some performance increase in some scenarios, although maybe not as much as I expect, and I see no increase whatsoever in some other scenarios (e.g. the factorial calculation).
- Do you see what may cause the poor performance?
- Do you have any other comments (style, naming of variables, etc)?
mult
is in charge of picking what it thinks is the most suitable algorithm given the operands' size; 3 algorithms are implemented so far (if you are unfamiliar with them, you can get some introduction in the documentation I dropped on the linked repository).
Note that I have set a high threshold. At such value, the benefit of calling mult_toom3
over mult_karatsuba
should be visible even if called only once, and then never again in the recursion.
mult_toom3
performs Toom-3 calculation, callingmult
recursively.
In the following mult_toom3()
implementation, I have added some time measures (when //#define MeasureTime
is uncommented) to show where processing time is being spent. Almost all of it is in the recursion (and none in the calculations around).
If you want to test for yourselves, you'll need to download the full source available on my repository and e.g. use a main()
function such as the one below.
- Squaring of a number.
Operands have the same size and there is definitely a performance gain. - Factorial calculation.
The factorial algorithm is designed to multiply big operands together (it does not do((((1 * 2) * 3) * 4) ...
) but the operands have mismatching sizes nonetheless (the bigger operand ends up being about 15% longer than the shorter operand) and the performance gain is all but obvious (it even looks like a performance loss on average).
Note that a lot of smaller operands are multiplied together during the factorial calculation (it happens behind the scenes) so if the squaring of a number was about 20% or so faster with Toom-3, this is not necessarily a ratio we should expect for the factorial. It is probably more relevant to measure an improvement in time difference than in time ratio here; all the more so when multiplying larger numbers together.
Here is the output I get out of my main()
for the integer squaring, showing a gain in performance thanks to the Toom-3 algorithm:
- When Toom-3 is commented out (i.e.
mult_karatsuba
is called in every scenario wheremult_toom3
should have been called):Starting multiplication... Total multiplication time: 141 ms
- When Toom-3 is not commented:
Starting multiplication... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 27 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 80 ms Equation resolution : 1 ms Total multiplication time: 115 ms
- When Toom-3 is commented (Worst of 3 runs):
Starting factorial calculation... Total multiplication time: 3811 ms
- When Toom-3 is not commented (Best and worst of 3 runs):
Starting factorial calculation... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 51 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 218 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 152 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 23 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 507 ms Equation resolution : 2 ms Total multiplication time: 3799 ms
Starting factorial calculation... Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 64 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 200 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 154 ms Equation resolution : 1 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 23 ms Equation resolution : 0 ms Memory allocation + copy : 0 ms Variable calculation (+/-): 0 ms Recursion : 507 ms Equation resolution : 2 ms Total multiplication time: 3990 ms
The first test case (multiplication) shows a performance gain even if
mult_toom3
is called once (the operands' value was kept relatively small for that purpose: no more than 1 call per multiplication, i.e. 2 calls in total).The threshold may be so high that it does not leverage as much performance gain as was possible (more than 99% of the time spent in Toom-3 is in the recursion; the additional less complex operations take almost nothing) and indeed, decreasing it to 2000 allows the product to be completed faster.
But again, the goal here was to test the algorithm in a number range where toom-3 is unquestionably faster than karatsuba.
200k! takes the same amount of time with or without
mult_toom3
but you could edit the code to confirm 500k! or even 1M! do as well... To me, it looks as if the complexity for Toom-3 was the same as for Karatsuba.Note that if 200K! was to take the same time in both scenarios by sheer coincidence, then I would expect 1M! (with over 5✕ more digits) to show a clear difference, which it does not.
Last but not least, the gain I measure for 453256100k is not reproduced for 1M!, although 1M! is considerably bigger (more than 5 million digits) than 453256100k (only 565.6 thousand digits).
I fail to understand why it appears that my Toom-3 algorithm has the same complexity as Karatsuba.