Given an array of integers find sum of integers which would be in sorted array between given positions

Question 1

I have to find the best possible algorithm(complexity) for my task.

Input: indexes first, last and array

Output: sum of integers in the same array after being sorted between positions first and last.

Numbers in array are different(can be negative)!

For example: Input: first = 3, last = 7, array = {5,4,2,6,8,9,0,-1,3}

Output: 26(3+4+5+6+8)

What I've tried =>

We can easily sort array and just calculate it, it will be O(nlogn)
We can count differences between number of elements in array and our indexes first and last and choose counted number of maximum elements or minimum and remove from our actual sum of the array.

For example: count sum of (n-last) maximum integers and then count sum of (first - 0) minimum integers and subtract from our actual sum, however it will not always be good idea, because finding this number of maximum or minimum integers in array can be expensive. Of course I can easily make some improvements such as calculate when it is better to take sum of (n-last) maximum numbers or only (last) maximum numbers.

What am I asking for is whether there is better solution to this problem then solving some equations and making huge amount of if's to improve it.

Question 2

Why not a single pass? Just check if the element is between your bounds.

Question 3

But I have indexes, not numbers in array

Question 4

I see what your saying now. Your example threw me off a little with the first one being 3.

Question 5

Since you are dealing with integers, you could essentially create your own type of ordered set. I think this would be more efficient.

Question 6

@MartinBonner: It depends on whether the range of values is restricted. If it is, you can build a histogram in linear time, and then evaluate the histogram in O(M) time, linear in the range of values.

Question 7

Take a look at the std::nth_element algorithm, which separates "first N" from "elements past N" without doing the extra work of sorting inside the two partitions.

For your purposes, you'll need to call nth_element twice. The second call will be on one of the partitions created in the first step, not the whole array. At the end, you'll have three partitions:

Elements less than those you need
Elements you need
Elements greater than those you need

and it typically does this in linear time, although worst-case is still O(N lg N)

Question 8

This is an awesome approach. The benchmarks back it up as well.

Question 9

Here is an approach that is faster than the proposed solution by the OP. Though not as elegant or general as the excellent solution provided by @BenVoigt, it is almost as fast.

double boundedSumJoe(std::vector<int> x, int lower, int upper) {
 int myMax = *std::max_element(x.begin(), x.end());
 int offSet = std::abs(*std::min_element(x.begin(), x.end())) + 1;
 unsigned long int myRange;
 if (myMax > 0)
 myRange = myMax + offSet; // E.g. if myMax = 10 & myMin = -2, then myRange = 12
 else
 myRange = offSet;
 offSet--;
 std::vector<int> frequency(myRange, 0);
 std::vector<int> values(myRange, 0);
 std::vector<int>::iterator it, itEnd = x.end();
 int myIndex;
 double mySum = 0;
 for (it = x.begin(); it < itEnd; it++) {
 myIndex = *it + offSet;
 frequency[myIndex]++;
 values[myIndex] = *it;
 }
 int count = 0;
 bool firstHit = true;
 for (std::size_t j = 0; j < myRange; j++) {
 if (frequency[j]) {
 if (count >= lower) {
 if (count <= upper) {
 firstHit = false;
 mySum += values[j] * frequency[j];
 } else {
 if ((count - upper) > 1) {
 int k = j - 1;
 while (!frequency[k]) {k--;}
 mySum -= (values[k] * (count - upper - 1));
 }
 break;
 }
 }
 count += frequency[j];
 if ((count - lower) >= 1 && firstHit) {
 firstHit = false;
 mySum += (values[j] * (count - lower));
 }
 }
 }
 return mySum;
}

We first create two vectors large enough to span the entire range of input values. One of them keeps the values from the input vector and the other keeps a tally of that value (frequency vector above). Elements are added in order as the index is made from the value itself.

We then loop over the frequency vector and sum up the resulting values between our two bounds. ~~(削除) A shortcoming of the above method is that it generally will return incorrect results if there are duplicate values in the input vector. (削除ここまで)~~ Thanks to the suggestions by @BenVoigt, the above can now handle input vectors with duplicate values. As you can see, some care is needed on the edges (hence the additional if ((count - upper) > 1) as well as the lines following if ((count - lower) >= 1 && firstHit)).

Here are some very basic benchmarks that truly show the power of the solution provided by @BenVoigt. First off, here is an implementation of the OP and an implementation using std::nth_element.

double boundedSumOP(std::vector<int> x, int lower, int upper) {
 double mySum = 0;
 std::sort(x.begin(), x.end());
 std::vector<int>::iterator it, itEnd = x.begin() + upper;
 for (it = x.begin() + lower; it <= itEnd; it++)
 mySum += *it;
 return mySum;
}
double boundedSumBen(std::vector<int> x, int lower, int upper) {
 double mySum = 0;
 // First partition vector at larger bound
 std::nth_element(x.begin(), x.begin() + upper, x.end());
 // Now create partition of above at lower bound
 std::nth_element(x.begin(), x.begin() + lower, x.begin() + upper);
 std::vector<int>::iterator it, itEnd = x.begin() + upper;
 for (it = x.begin() + lower; it <= itEnd; it++)
 mySum += *it;
 return mySum;
}

Here is the main function that is used for testing, somewhat crude I might add:

int main() {
 std::vector<int> v(200001);
 std::random_device rd;
 std::mt19937 gen(rd());
 std::iota(v.begin(), v.end(), -100000);
 std::shuffle(v.begin(), v.end(), gen);
 // random-sample without replacement
 std::vector<int> randVec(v.begin(), v.begin() + 100000);
 int val1, val2, val3;
 std::clock_t start_time, end_time;
 start_time = clock();
 for (std::size_t i = 0; i < 100; i++)
 val1 = boundedSumBen(randVec, 49900, 50100);
 end_time = clock();
 std::cout << "time taken on sample w/o rep std::nth_element : " <<
 end_time - start_time << std::endl;
 start_time = clock();
 for (std::size_t i = 0; i < 100; i++)
 val2 = boundedSumJoe(randVec, 49900, 50100);
 end_time = clock();
 std::cout << "time taken on sample w/o rep indexing method by Joe : " <<
 end_time - start_time << std::endl;
 start_time = clock();
 for (std::size_t i = 0; i < 100; i++)
 val3 = boundedSumOP(randVec, 49900, 50100);
 end_time = clock();
 std::cout << "time taken on sample w/o rep naive approach with std::sort : " <<
 end_time - start_time << std::endl;
 std::cout << "All functions on sample w/o rep return the same value of : " <<
 val1 << ", " << val2 << ", and " << val3 << std::endl;
 // Now we test a random sample with replacement
 std::uniform_int_distribution<int> distribution(-100000, 100000);
 for (std::size_t i = 0; i < 100000; i++)
 randVec[i] = distribution(gen);
 start_time = clock();
 for (std::size_t i = 0; i < 100; i++)
 val1 = boundedSumBen(randVec, 9900, 10100);
 end_time = clock();
 std::cout << "time taken on sample with rep std::nth_element : " <<
 end_time - start_time << std::endl;
 start_time = clock();
 for (std::size_t i = 0; i < 100; i++)
 val2 = boundedSumJoe(randVec, 9900, 10100);
 end_time = clock();
 std::cout << "time taken on sample with rep indexing method by Joe : " <<
 end_time - start_time << std::endl;
 start_time = clock();
 for (std::size_t i = 0; i < 100; i++)
 val3 = boundedSumOP(randVec, 9900, 10100);
 end_time = clock();
 std::cout << "time taken on sample with rep naive approach with std::sort : " <<
 end_time - start_time << std::endl;
 std::cout << "All functions on sample with rep return the same value of : " <<
 val1 << ", " << val2 << ", and " << val3 << std::endl;
 std::cout << "Number of unique elements in vector with replacement "
 << std::set<int>(randVec.begin(), randVec.end()).size()
 << std::endl;
 return 0;
}

And the results on my computer* (I'm using clang++) :

time taken on sample w/o rep std::nth_element : 109925
time taken on sample w/o rep indexing method by Joe : 110162
time taken on sample w/o rep naive approach with std::sort : 581368
All functions on sample w/o rep return the same value of : 38849, 38849, and 38849
time taken on sample with rep std::nth_element : 93542
time taken on sample with rep indexing method by Joe : 102780
time taken on sample with rep naive approach with std::sort : 517273
All functions on sample with rep return the same value of : -16069147, -16069147, and -16069147
Number of unique elements in vector with replacement 78605

As you can see, employing the std::nth_element provided by @BenVoigt is superior in terms of both speed and generality, while the indexing method is still quite a bit faster than the naive approach.

Here are the results from ideone (running gcc).

Question 10

Note that the bitmask method can't handle repeated entries. Building a histogram that can represent repeats is not difficult (replace myBools[myIndex] = 1; with myBools[myIndex]++; and probably rename the variable to be more accurate) but the summation code becomes more complicated.

Question 11

Also your use of nth_element appears to be wrong. It certainly differs from what I answered, that "The second call will be on one of the partitions created in the first step, not the whole array." Since you got the correct result, I'm suspicious that your test case might not be a valid test.

Question 12

@BenVoigt, thanks for your suggestion of building a histogram. I will definitely keep that in mind. Also, sorry for mucking up the nth_element call. I have corrected it. Interestingly enough, before I corrected it an ran it on ideone, I was not getting the same values as the other two methods, but when I ran it on my own machine, it was working fine. I'm guessing the difference in compilers has something to do with it. Anywho, I really appreciate your comments and all of your great answers. I have learned so much reading your post on c++.

Ben Voigt Ben Voigt 285k45 gold badges441 silver badges758 bronze badges · Accepted Answer · 2018-04-01 15:15:01Z

Take a look at the std::nth_element algorithm, which separates "first N" from "elements past N" without doing the extra work of sorting inside the two partitions.

For your purposes, you'll need to call nth_element twice. The second call will be on one of the partitions created in the first step, not the whole array. At the end, you'll have three partitions:

Elements less than those you need
Elements you need
Elements greater than those you need

and it typically does this in linear time, although worst-case is still O(N lg N)

This is an awesome approach. The benchmarks back it up as well.

CollectivesTM on Stack Overflow

Given an array of integers find sum of integers which would be in sorted array between given positions

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related