Searching through a huge set with threads

Question 1

I wrote a class that finds values in a huge search volume. I have a pretty strong feeling that my code is not good and would like to get your opinion on it. My experience with C++11 threading is very limited and I think I use the wrong features for my task.

#include <iostream>
#include <vector>
#include <utility>
#include <future>
class Searcher
{
public:
 std::future<bool> Find(int x) {
 m_searches.emplace_back();
 m_searches.rbegin()->first = x;
 return m_searches.rbegin()->second.get_future();
 }
 void Search()
 {
 // very large search data
 static const std::vector<int> haystack = { 4, 5, 1, 3, 9, 8, 7, 6, 11, 42, 13 };
 std::vector<int> foundIndices;
 // find values
 for (auto h : haystack) {
 for (size_t i = 0; i < m_searches.size(); i++) {
 if (m_searches[i].first == h) {
 foundIndices.push_back(i);
 m_searches[i].second.set_value(true);
 }
 }
 }
 // resolve promises, where no value was found
 for (size_t i = 0; i < m_searches.size(); i++) {
 bool found = false;
 for (auto j : foundIndices)
 if (j == i)
 found = true;
 if (!found)
 m_searches[i].second.set_value(false);
 }
 }
private:
 std::vector<std::pair<int, std::promise<bool>>> m_searches;
};
int main()
{
 Searcher s;
 auto s1 = s.Find(1);
 auto s2 = s.Find(2);
 auto s3 = s.Find(3);
 s.Search();
 std::cout << s1.get() << std::endl;
 std::cout << s2.get() << std::endl;
 std::cout << s3.get() << std::endl;
 return 0;
}

There should be one search running for many Finds, because the set that is searched is very large. It could however run concurrently to split the search up to multiple threads. For example with two threads: The first thread searches the first half, the second thread searches the second half.

The futures are not really guaranteed to be valid, because the user has to call the Search function. Would it be possible to start the search, once the first .get was called?

My "algorithm" to fulfil the promises looks very ugly and complicated, but I can't think of a better way.

Question 2

There are definitely some problems if you want to make this threaded. Firstly, access to m_searches is not protected in any way. Since this is a shared bit of memory, it would need to be protected by something (say a mutex) before you modify it.

Generally in situations like these, it's far better to let std::async do a lot of the hard work for you. Also, try and eliminate shared mutable state as much as possible, because it makes life a lot harder. Firstly, let's write a simple search function that'll return an index if we find a match:

const std::size_t not_found = -1;
template <typename Iterator, typename T>
std::size_t find_index(Iterator begin, Iterator end, 
 const T& value, std::size_t start)
{
 std::size_t s = start;
 Iterator it = begin;
 while(it != end) {
 if(*it == value) {
 return s;
 }
 ++s;
 ++it;
 }
 return not_found;
}

Ok, this is just our normal linear search, but it allows us to start from anywhere in our search range since we're only passing in a few iterators.

Next, we define a vector to search over:

std::vector<int> v { 4, 5, 1, 3, 9, 8, 7, 6, 11, 42, 13 };
typedef typename std::vector<int>::iterator iter_t;

Create our vector of future to hold results:

std::vector<std::future<std::size_t>> futures;

And then populate it:

futures.emplace_back(
 std::async(std::launch::async, find_index<iter_t, int>, 
 v.begin(), v.begin() + 6, 8, 0));
futures.emplace_back(
 std::async(std::launch::async, find_index<iter_t, int>, 
 v.begin() + 6, v.end(), 8, 6));

Note the std::launch::async flag specifies we want to create a new thread to run our function in. I've hardcoded the search start and end places here, but you can obviously add some logic to split things up evenly if you'd like. Also, note that we need to explicitly define the template types here, since they won't be deduced.

Now we'll set up a vector to store our results in, and get the results back:

std::vector<std::size_t> results;
for(auto& future : futures) {
 results.emplace_back(future.get());
}

Finally, print out anything that was found:

std::for_each(results.begin(), results.end(),
 [](std::size_t index)
 { if(index != not_found) std::cout << index << "\n"; });
}

The whole thing can be found here.

Hopefully this gives you an idea of a different way of approaching your problem.

Question 3

Thanks you for your answer. Your code breaks my constraint that there should be only one search running for multiple things that have to be found. The volume that I search through is so big, that it should only be traversed once like in my given example code.

Yuushi Yuushi 11.1k2 gold badges31 silver badges66 bronze badges · Answer 1 · 2013-08-20 01:29:52Z

There are definitely some problems if you want to make this threaded. Firstly, access to m_searches is not protected in any way. Since this is a shared bit of memory, it would need to be protected by something (say a mutex) before you modify it.

Generally in situations like these, it's far better to let std::async do a lot of the hard work for you. Also, try and eliminate shared mutable state as much as possible, because it makes life a lot harder. Firstly, let's write a simple search function that'll return an index if we find a match:

const std::size_t not_found = -1;
template <typename Iterator, typename T>
std::size_t find_index(Iterator begin, Iterator end, 
 const T& value, std::size_t start)
{
 std::size_t s = start;
 Iterator it = begin;
 while(it != end) {
 if(*it == value) {
 return s;
 }
 ++s;
 ++it;
 }
 return not_found;
}

Ok, this is just our normal linear search, but it allows us to start from anywhere in our search range since we're only passing in a few iterators.

Next, we define a vector to search over:

std::vector<int> v { 4, 5, 1, 3, 9, 8, 7, 6, 11, 42, 13 };
typedef typename std::vector<int>::iterator iter_t;

Create our vector of future to hold results:

std::vector<std::future<std::size_t>> futures;

And then populate it:

futures.emplace_back(
 std::async(std::launch::async, find_index<iter_t, int>, 
 v.begin(), v.begin() + 6, 8, 0));
futures.emplace_back(
 std::async(std::launch::async, find_index<iter_t, int>, 
 v.begin() + 6, v.end(), 8, 6));

Note the std::launch::async flag specifies we want to create a new thread to run our function in. I've hardcoded the search start and end places here, but you can obviously add some logic to split things up evenly if you'd like. Also, note that we need to explicitly define the template types here, since they won't be deduced.

Now we'll set up a vector to store our results in, and get the results back:

std::vector<std::size_t> results;
for(auto& future : futures) {
 results.emplace_back(future.get());
}

Finally, print out anything that was found:

std::for_each(results.begin(), results.end(),
 [](std::size_t index)
 { if(index != not_found) std::cout << index << "\n"; });
}

The whole thing can be found here.

Hopefully this gives you an idea of a different way of approaching your problem.

Thanks you for your answer. Your code breaks my constraint that there should be only one search running for multiple things that have to be found. The volume that I search through is so big, that it should only be traversed once like in my given example code.

Stack Exchange Network

Searching through a huge set with threads

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Searching through a huge set with threads

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions