I currently am creating a vector and fill it with random values( for this case the random values will be 1) once the vector is created i want to calculate the total sum which should be 100000000, I am using ones to test that the addition was done correctly.
My goal is to make this code more efficient and run faster, I would appreciate any suggestions
the code needs to utilize multithreading
#include <chrono>
#include <iostream>
#include <mutex>
#include <random>
#include <thread>
#include <vector>
#include <atomic>
// Constant vector size; 100,000,000
constexpr long long SIZE = 100000000;
// Dividing the vector into 4 equal parts
// first quarter: 0 --> size/4
constexpr long long FIRST_PART = (SIZE /4);
// second quarter: size/4 --> size/2
constexpr long long SECOND_PART = 2*(SIZE / 4);
// third quarter: size/2 --> 3*(size/4)
constexpr long long THIRD_PART = 3*(SIZE / 4);
// fourth quarter: 3*(size/4) --> end (size)
constexpr long long FOURTH_PART = SIZE;
// Global mutex to guard the shared variable
std::mutex myMutex;
Here is my function to total the vector
void findSum(unsigned long long& sum,
const std::vector<int>& myVector,
unsigned long long begin,
unsigned long long end) {
for (auto it = begin; it < end; ++it) {
// create a lock to guard the shared variable
// no need to unlock the lock "myLock" when using luck_guard
// it is unlocked implicitly after executing the statement.
std::lock_guard<std::mutex> myLock(myMutex);
//std::atomic<std::atomic<int64_t>> myattempt();
// what is a mutex
//mutex xmutex;
sum += myVector[it];
}
}
Below is my main function where i try to use multi-threading
int main() {
// start timers
auto startTime = std::chrono::system_clock::now();
auto startTotalTime = std::chrono::system_clock::now();
//std::cout << "Start variable data type is: " << typeid(start).name() << std::endl;
std::cout << "Creating the Vector" << std::endl;
// create a vector of random variables
std::vector<int> randValues;
randValues.reserve(SIZE);
std::cout << "Filling the Vector" << std::endl;
// use the Mersenne Twister algorithm to generate random numbers
std::mt19937 rndEngine;
std::uniform_int_distribution<> uniformDist(1, 1);
for (long long i = 0; i < SIZE; i++)
randValues.push_back(uniformDist(rndEngine));
// stop the timer and calculate the elapsed time
std::chrono::duration<double> elapsedTime = std::chrono::system_clock::now() - startTime;
std::cout << "Time for creating and filling the vector: " << elapsedTime.count() << " seconds" << std::endl;
std::cout << "Starting the work ... Calculating the sum ..." << std::endl;
unsigned long long totalSum = 0;
// reset the timer
startTime = std::chrono::system_clock::now();
// finding the sum without using threads
//findSum(std::ref(totalSum), std::ref(randValues), 0, SIZE);
std::thread t1(findSum, std::ref(totalSum), std::ref(randValues), 0, FIRST_PART);
std::thread t2(findSum, std::ref(totalSum), std::ref(randValues), FIRST_PART, SECOND_PART);
std::thread t3(findSum, std::ref(totalSum), std::ref(randValues), SECOND_PART, THIRD_PART);
std::thread t4(findSum, std::ref(totalSum), std::ref(randValues), THIRD_PART, FOURTH_PART);
t1.join();
t2.join();
t3.join();
t4.join();
// stop the timer and get the elapsed time
elapsedTime = std::chrono::system_clock::now() - startTime;
std::cout << "Time for addition " << elapsedTime.count() << " seconds" << std::endl;
std::cout << "Total Sum Result: " << totalSum << std::endl;
// stop the total timer and calculate the elapsed time for the entire program
std::chrono::duration<double> elapsedTotalTime = std::chrono::system_clock::now() - startTotalTime;
std::cout << "--------------------------------------------------------------" << std::endl;
std::cout << "--------------------------------------------------------------" << std::endl;
std::cout << "The time for completing the entire task is: " << elapsedTotalTime.count() << " seconds" << std::endl;
std::cout << "--------------------------------------------------------------" << std::endl;
std::cout << "--------------------------------------------------------------" << std::endl;
return 0;
}
-
\$\begingroup\$ Is the code working as expected, if so then you might want to remove the word try or trying. \$\endgroup\$pacmaninbw– pacmaninbw ♦2022年09月25日 14:09:29 +00:00Commented Sep 25, 2022 at 14:09
-
\$\begingroup\$ What's the point of adding up random numbers? If you need to get that sum with the same distribution, that can be done in O(1) analytically \$\endgroup\$Reinderien– Reinderien2022年09月25日 16:39:26 +00:00Commented Sep 25, 2022 at 16:39
2 Answers 2
Disclaimer:I didn't look deep into possible problems.
The way it is currently coded defeats the purpose of multithreading: the thread's access to the global sum
is essentially sequential. You'd be in a better shape if each thread computed - and returned - the sum over its own domain, and the main thread added the four returned sum to the grand total.
Would it be faster than the naive single-threaded summation? I don't know. Perhaps yes, perhaps no. The pattern of access to the array looks very cache-unfriendly, possibly incurring too many cache misses. You already have the profiling infrastucture in place, so go an extra mile and compare the implementations.
// Global mutex to guard the shared variable
std::mutex myMutex;
In general, prefer to put names inside namespace, local or class scope. Make a name global only if you really have to, because a bunch of poorly kept global names may cause subtle definition-related ambiguities, like odr violation.
constexpr long long FOURTH_PART = SIZE;
Concurrent execution efficiency is bounded by the number of hardware threads varying for different platforms. A program is less efficient when the number of program execution threads is greater than the number of available hardware threads due to the additional thread scheduling overhead. We should first obtain the optimal number of threads
unsigned int nthreads = std::thread::hardware_concurrency();
The result may be zero. In which case, I would either manually check the platform specific features, or simply assume the single threaded case.
Note that the program actually occupies five execution threads: one main and four summating threads.
Now the program is pretty useless because we have the hard-coded input data SIZE
.
Otherwise, we would have to additionally define per-thread sizes of data chunks.
Types of myVector
elements and sum
mismatch.
Signed-to-unsigned conversion produces incorrect summation result for negative elements.
It is important to note because conversion is performed inside the function that is not responsible for the vector contents.
unsigned long long
is pretty wordy.
std::size_t
is the proper unsigned integer type used for indexing over containers because it represents any index value.
We could also use other integral types as long as we could guarantee that all the used index values lay inside the types' domains.
A possible implementation may be like this
void findSum(std::vector<int>::const_iterator b,
std::vector<int>::const_iterator e,
long long& partialSum)
{
long long sum = 0;
for (; b != e; ++b)
sum += *b;
partialSum = sum;
}
long long sum(const std::vector<int>& v, const std::size_t minChunkSize = 1000)
{
const unsigned int nthreads = std::thread::hardware_concurrency();
std::vector<std::thread> threads;
std::vector<long long> partialSums;
partialSums.resize(nthreads); // a bit of redundancy for simplicity
std::size_t chunkSize = v.size() / (nthreads == 0 ? 1 : nthreads);
chunkSize = std::max(chunkSize, minChunkSize);
auto b = std::begin(v);
for (std::size_t size = 0, i = 0; size + chunkSize < v.size(); size += chunkSize) {
threads.push_back(std::thread{ findSum, b, b + chunkSize, std::ref(partialSums[i++]) });
b += chunkSize;
}
long long totalSum = 0;
findSum(b, std::end(v), totalSum); // use main thread too
for (std::thread& t : threads)
t.join();
for (long long ps : partialSums)
totalSum += ps;
return totalSum;
}
The chunk size may be chosen heuristically.