the multi-threading of adding all the values from a random generated vector

Question 1

I currently am creating a vector and fill it with random values( for this case the random values will be 1) once the vector is created i want to calculate the total sum which should be 100000000, I am using ones to test that the addition was done correctly.

My goal is to make this code more efficient and run faster, I would appreciate any suggestions

the code needs to utilize multithreading

#include <chrono>
#include <iostream>
#include <mutex>
#include <random>
#include <thread>
#include <vector>
#include <atomic>
// Constant vector size; 100,000,000
constexpr long long SIZE = 100000000;
// Dividing the vector into 4 equal parts
// first quarter: 0 --> size/4
constexpr long long FIRST_PART = (SIZE /4);
// second quarter: size/4 --> size/2
constexpr long long SECOND_PART = 2*(SIZE / 4);
// third quarter: size/2 --> 3*(size/4)
constexpr long long THIRD_PART = 3*(SIZE / 4);
// fourth quarter: 3*(size/4) --> end (size)
constexpr long long FOURTH_PART = SIZE;
// Global mutex to guard the shared variable
std::mutex myMutex;

Here is my function to total the vector

void findSum(unsigned long long& sum,
 const std::vector<int>& myVector,
 unsigned long long begin,
 unsigned long long end) {
 for (auto it = begin; it < end; ++it) {
 // create a lock to guard the shared variable
 // no need to unlock the lock "myLock" when using luck_guard
 // it is unlocked implicitly after executing the statement.
 std::lock_guard<std::mutex> myLock(myMutex);
 //std::atomic<std::atomic<int64_t>> myattempt();
 // what is a mutex
 //mutex xmutex;
 sum += myVector[it];
 }
}

Below is my main function where i try to use multi-threading

int main() {
 // start timers
 auto startTime = std::chrono::system_clock::now();
 auto startTotalTime = std::chrono::system_clock::now();
 //std::cout << "Start variable data type is: " << typeid(start).name() << std::endl;
 std::cout << "Creating the Vector" << std::endl;
 // create a vector of random variables
 std::vector<int> randValues;
 randValues.reserve(SIZE);
 
 
 std::cout << "Filling the Vector" << std::endl;
 // use the Mersenne Twister algorithm to generate random numbers
 std::mt19937 rndEngine;
 std::uniform_int_distribution<> uniformDist(1, 1);
 for (long long i = 0; i < SIZE; i++)
 randValues.push_back(uniformDist(rndEngine));
 
 // stop the timer and calculate the elapsed time
 std::chrono::duration<double> elapsedTime = std::chrono::system_clock::now() - startTime;
 std::cout << "Time for creating and filling the vector: " << elapsedTime.count() << " seconds" << std::endl;
 std::cout << "Starting the work ... Calculating the sum ..." << std::endl;
 unsigned long long totalSum = 0;
 // reset the timer
 startTime = std::chrono::system_clock::now();
 // finding the sum without using threads
 //findSum(std::ref(totalSum), std::ref(randValues), 0, SIZE);
 
 std::thread t1(findSum, std::ref(totalSum), std::ref(randValues), 0, FIRST_PART);
 std::thread t2(findSum, std::ref(totalSum), std::ref(randValues), FIRST_PART, SECOND_PART);
 std::thread t3(findSum, std::ref(totalSum), std::ref(randValues), SECOND_PART, THIRD_PART);
 std::thread t4(findSum, std::ref(totalSum), std::ref(randValues), THIRD_PART, FOURTH_PART);
 t1.join();
 t2.join();
 t3.join();
 t4.join();
 
 // stop the timer and get the elapsed time
 elapsedTime = std::chrono::system_clock::now() - startTime;
 std::cout << "Time for addition " << elapsedTime.count() << " seconds" << std::endl;
 std::cout << "Total Sum Result: " << totalSum << std::endl;
 // stop the total timer and calculate the elapsed time for the entire program
 std::chrono::duration<double> elapsedTotalTime = std::chrono::system_clock::now() - startTotalTime;
 std::cout << "--------------------------------------------------------------" << std::endl;
 std::cout << "--------------------------------------------------------------" << std::endl;
 std::cout << "The time for completing the entire task is: " << elapsedTotalTime.count() << " seconds" << std::endl;
 std::cout << "--------------------------------------------------------------" << std::endl;
 std::cout << "--------------------------------------------------------------" << std::endl;
 
 return 0;
}

Question 2

Is the code working as expected, if so then you might want to remove the word try or trying.

Question 3

What's the point of adding up random numbers? If you need to get that sum with the same distribution, that can be done in O(1) analytically

Question 4

Disclaimer:I didn't look deep into possible problems.

The way it is currently coded defeats the purpose of multithreading: the thread's access to the global sum is essentially sequential. You'd be in a better shape if each thread computed - and returned - the sum over its own domain, and the main thread added the four returned sum to the grand total.

Would it be faster than the naive single-threaded summation? I don't know. Perhaps yes, perhaps no. The pattern of access to the array looks very cache-unfriendly, possibly incurring too many cache misses. You already have the profiling infrastucture in place, so go an extra mile and compare the implementations.

Question 5

// Global mutex to guard the shared variable
std::mutex myMutex;

In general, prefer to put names inside namespace, local or class scope. Make a name global only if you really have to, because a bunch of poorly kept global names may cause subtle definition-related ambiguities, like odr violation.

constexpr long long FOURTH_PART = SIZE;

Concurrent execution efficiency is bounded by the number of hardware threads varying for different platforms. A program is less efficient when the number of program execution threads is greater than the number of available hardware threads due to the additional thread scheduling overhead. We should first obtain the optimal number of threads

unsigned int nthreads = std::thread::hardware_concurrency();

The result may be zero. In which case, I would either manually check the platform specific features, or simply assume the single threaded case.

Note that the program actually occupies five execution threads: one main and four summating threads.

Now the program is pretty useless because we have the hard-coded input data SIZE. Otherwise, we would have to additionally define per-thread sizes of data chunks.

Types of myVector elements and sum mismatch. Signed-to-unsigned conversion produces incorrect summation result for negative elements. It is important to note because conversion is performed inside the function that is not responsible for the vector contents.

unsigned long long is pretty wordy. std::size_t is the proper unsigned integer type used for indexing over containers because it represents any index value. We could also use other integral types as long as we could guarantee that all the used index values lay inside the types' domains.

A possible implementation may be like this

void findSum(std::vector<int>::const_iterator b,
 std::vector<int>::const_iterator e,
 long long& partialSum)
{
 long long sum = 0;
 for (; b != e; ++b)
 sum += *b;
 partialSum = sum;
}
long long sum(const std::vector<int>& v, const std::size_t minChunkSize = 1000)
{
 const unsigned int nthreads = std::thread::hardware_concurrency();
 std::vector<std::thread> threads;
 std::vector<long long> partialSums;
 partialSums.resize(nthreads); // a bit of redundancy for simplicity
 std::size_t chunkSize = v.size() / (nthreads == 0 ? 1 : nthreads);
 chunkSize = std::max(chunkSize, minChunkSize);
 auto b = std::begin(v);
 for (std::size_t size = 0, i = 0; size + chunkSize < v.size(); size += chunkSize) {
 threads.push_back(std::thread{ findSum, b, b + chunkSize, std::ref(partialSums[i++]) });
 b += chunkSize;
 }
 long long totalSum = 0;
 findSum(b, std::end(v), totalSum); // use main thread too
 for (std::thread& t : threads)
 t.join();
 for (long long ps : partialSums)
 totalSum += ps;
 return totalSum;
}

The chunk size may be chosen heuristically.

vnp vnp 58.6k4 gold badges55 silver badges144 bronze badges · Answer 1 · 2022-09-26 02:47:53Z

Disclaimer:I didn't look deep into possible problems.

The way it is currently coded defeats the purpose of multithreading: the thread's access to the global sum is essentially sequential. You'd be in a better shape if each thread computed - and returned - the sum over its own domain, and the main thread added the four returned sum to the grand total.

Would it be faster than the naive single-threaded summation? I don't know. Perhaps yes, perhaps no. The pattern of access to the array looks very cache-unfriendly, possibly incurring too many cache misses. You already have the profiling infrastucture in place, so go an extra mile and compare the implementations.

panik panik 6183 silver badges7 bronze badges · Answer 2 · 2022-10-04 23:11:22Z

// Global mutex to guard the shared variable
std::mutex myMutex;

In general, prefer to put names inside namespace, local or class scope. Make a name global only if you really have to, because a bunch of poorly kept global names may cause subtle definition-related ambiguities, like odr violation.

constexpr long long FOURTH_PART = SIZE;

Concurrent execution efficiency is bounded by the number of hardware threads varying for different platforms. A program is less efficient when the number of program execution threads is greater than the number of available hardware threads due to the additional thread scheduling overhead. We should first obtain the optimal number of threads

unsigned int nthreads = std::thread::hardware_concurrency();

The result may be zero. In which case, I would either manually check the platform specific features, or simply assume the single threaded case.

Note that the program actually occupies five execution threads: one main and four summating threads.

Now the program is pretty useless because we have the hard-coded input data SIZE. Otherwise, we would have to additionally define per-thread sizes of data chunks.

Types of myVector elements and sum mismatch. Signed-to-unsigned conversion produces incorrect summation result for negative elements. It is important to note because conversion is performed inside the function that is not responsible for the vector contents.

unsigned long long is pretty wordy. std::size_t is the proper unsigned integer type used for indexing over containers because it represents any index value. We could also use other integral types as long as we could guarantee that all the used index values lay inside the types' domains.

A possible implementation may be like this

void findSum(std::vector<int>::const_iterator b,
 std::vector<int>::const_iterator e,
 long long& partialSum)
{
 long long sum = 0;
 for (; b != e; ++b)
 sum += *b;
 partialSum = sum;
}
long long sum(const std::vector<int>& v, const std::size_t minChunkSize = 1000)
{
 const unsigned int nthreads = std::thread::hardware_concurrency();
 std::vector<std::thread> threads;
 std::vector<long long> partialSums;
 partialSums.resize(nthreads); // a bit of redundancy for simplicity
 std::size_t chunkSize = v.size() / (nthreads == 0 ? 1 : nthreads);
 chunkSize = std::max(chunkSize, minChunkSize);
 auto b = std::begin(v);
 for (std::size_t size = 0, i = 0; size + chunkSize < v.size(); size += chunkSize) {
 threads.push_back(std::thread{ findSum, b, b + chunkSize, std::ref(partialSums[i++]) });
 b += chunkSize;
 }
 long long totalSum = 0;
 findSum(b, std::end(v), totalSum); // use main thread too
 for (std::thread& t : threads)
 t.join();
 for (long long ps : partialSums)
 totalSum += ps;
 return totalSum;
}

The chunk size may be chosen heuristically.

Stack Exchange Network

the multi-threading of adding all the values from a random generated vector

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

the multi-threading of adding all the values from a random generated vector

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions