I have relatively-small algorithm that takes up ~60% of the total run-time of my scientific code (57 lines of 3600), so I would like to find a way to optimize what I'm doing and make the code order-independent so that I can apply a cilk_for
parallel strcture.
Here's what it does, verbally: I have an std::vector
of pointers to custom objects called Segment
(vector<Segment*> newSegment
). Each Segment
contains a std::vector
of integers (mesh indices). In this function, I would like to find any Segment
that overlaps with any another, with overlap being defined as the member indices
overlapping on the number line. If they do overlap, I would like to join them together (insert the A.indices
into B.indices
) and delete one (delete A
).
ex. 1:
A.indices
={1,2,3} B.indices
={4,5,6} do not overlap; do nothing
ex. 2:
A.indices
={1,2,4} B.indices
={3,5,6} do overlap; A
= deleted B.indices
={1,2,3,4,5,6}
The overlaps are sparse, but existent.
Here's the current code:
main algorithm:
//make sure segments don't overlap
for (unsigned i = 0; i < newSegment.size(); ++i) {
if (newSegment[i]->size() == 0) continue;
for (unsigned j = i + 1; j < newSegment.size(); ++j) {
if (newSegment[i]->size() == 0) continue;
if (newSegment[j]->size() == 0) continue;
int i1 = newSegment[i]->begin();
int i2 = static_cast<int>(newSegment[i]->end());
int j1 = newSegment[j]->begin();
int j2 = static_cast<int>(newSegment[j]->end());
int L1 = abs(i1 - i2);
int L2 = abs(j1 - j2);
int dist = max(i1,i2,j1,j2) - min(i1,i2,j1,j2);
//if overlap, fold segments together
//copy indices from shorter segment to taller segment
if (dist <= L1 + L2) {
unsigned more, less;
if (newSegment[i]->slope == newSegment[j]->slope) {
if (value_max[i] > value_max[j]) {
more = i;
less = j;
} else {
more = j;
less = i;
}
} else if (newSegment[i]->size() == 1) {
more = j; less = i;
} else if (newSegment[j]->size() == 1) {
more = i; less = j;
} else assert(1 == 0);
while(!newSegment[less]->indices.empty()) {
unsigned index = newSegment[less]->indices.back();
newSegment[less]->indices.pop_back();
newSegment[more]->indices.push_back(index);
}
}
}
}//end overlap check
//delete empty segments
vector<unsigned> delList;
for (unsigned i = 0; i < newSegment.size(); ++i) {
if (newSegment[i]->size() == 0) { //delete empty
delList.push_back(i);
continue;
}
}
while (delList.size() > 0) {
unsigned index = delList.back();
delete newSegment.at(index);
newSegment.erase(newSegment.begin() + index);
delList.pop_back();
}
Relevant Segment
object class definition and member functions:
class Segment{
public:
Segment();
~Segment();
unsigned size();
int begin();
unsigned end();
std::vector<int> indices;
double slope;
};
int Segment::begin() {
if (!is_sorted(indices.begin(),indices.end())) std::sort(indices.begin(),indices.end());
if (indices.size() == 0) return -1;
return indices[0];
}
unsigned Segment::end() {
if (!is_sorted(indices.begin(),indices.end())) std::sort(indices.begin(),indices.end());
return indices.back();
}
unsigned Segment::size() {
unsigned indSize = indices.size();
if (indSize == 1) {
if (indices[0] == -1) return 0;
}
return indSize;
}
Ideas:
- Since I don't care about the order of the
Segment
objects, they could be in an orderless container? - In my algorithm, I find overlap by looking at the first and last
indices
of each segment. I do anstd::is_sorted
(and then maybe astd::sort
) when I fetch theindices
because the list can change when more indices are inserted. Maybe I could put theindices
in astd::set
rather thanstd::vector
to save the explicit sort-checking/sorting? I'm pretty sure that by editing the
indices
as I go, this makes it order-dependent. Perhaps, I could break the code into the following organization using the concept of an undirected graph to make it order-independent:- edge discovery (without modifying
indices
) - join clusters of connected nodes (
Segment
objects that overlap) using a graph traversal - delete empty
Segment
objects
- edge discovery (without modifying
Questions
- Are either of the ideas above worthwhile or negligible to performance?
- How else can I optimize it?
- How (if not the above) can I make the algorithm order-independent?
2 Answers 2
The is_sorted()
function is probably expensive, and so you should avoid it. Why not sort everything in one go right at the beginning before entering the loops?
The best way to optimize your code is by inventing a new algorithm which avoids the nested loops of N, because that has a complexity of O(N^2) (see "big-Oh notation".) See Bart van Ingen Schenau's comment below on how to achieve this.
-
The second
if (newSegment[i]->size() == 0) continue;
statement may, in fact, be needed since theindices
can be removed ifi
overlaps with an earlierj
. What's the performance consequence of sorting everything before entering the loops and after insertion VS makingindices
astd::set
instead?Stershic– Stershic11/08/2015 18:01:03Commented Nov 8, 2015 at 18:01 -
I think the overlap problem is pretty clear regardless of the rest of my code. I think the pairwise search for overlap is unavoidably O(N^2) (right?), so I want to make it as quick and parallel as possible.Stershic– Stershic11/08/2015 18:02:48Commented Nov 8, 2015 at 18:02
-
4@Stershic: The search for overlap isn't necessarily O(N^2). If you can change the order of the segments, you could order them by starting element. Then you would only have to look at adjacent segments to detect an overlap (you have an overlap if
Segment[i].end() > Segment[i+1].begin()
).Bart van Ingen Schenau– Bart van Ingen Schenau11/08/2015 19:50:01Commented Nov 8, 2015 at 19:50 -
1As for using sets, they theoretically have a O(1) access time just as arrays do, but in actuality they are burdened by a non-negligible constant factor, which is not accounted for by big-oh notation. In any case, I do not really see how sets could be of any use to you here.Mike Nakis– Mike Nakis11/08/2015 20:33:57Commented Nov 8, 2015 at 20:33
-
1I would recommend sorting the indices within each segment once in the beginning, and then either re-sorting after a merge, or, perhaps even better yet, merging the segments in such a way that no re-sorting is necessary. (It is easy to do when you know that both arrays are already sorted, see programmers.stackexchange.com/q/267406/41811.)Mike Nakis– Mike Nakis11/08/2015 20:34:36Commented Nov 8, 2015 at 20:34
I came to identical algorithm than @BartVanIngenSchenau in this comment
Basically sort the set of segments based on the min element of each segment. Then two adjacent element overlap if and only if Segment[i].max >= Segment[i+1].min
But I would like to add that sorting looks unnecessary at all and only keeping the max and the min element. Just update them when merging segments.
(segment1+segment2).min = min(segment1.min,segment2.min)
and (segment1+segment2).max = max(segment1.max,segment2.max)
Moreover if the segment are sorted by min element you have (Segment[i]+Segment[i+1]).min = segment[i].min
(but this last thing could be premature optimization.) I noted +
the merge of two segments.
For cache locality, the best for merging might be to have a layout simimar to the following layout
ptr_to_2nd_segment
n_elt_of_1st_segment,
min_elt_of_1st_segment,
[
[other_elts_of_1st_segment,]
max_elt_of_1st_segment,]
ptr_to_3rd_segment
n_elt_of_2nd_segment,
min_elt_of_2nd_segment,
[
[other_elts_of_2nd_segment,]
max_elt_of_2nd_segment,]
...
merging two elements in this configuration would be quite simple, it would be just a matter of updating ptr to next element, adding the number of elements,shifting the second segment elements and swapping the max elements if necessary. That would let some junk after each merge (8 bytes on 32 bits architecture and 16 bytes on 64 bits architecture). Knowing if you can support such junk is application dependent (moreover, you could doing a kind of garbage collection between two iteration of the algorithm.)
For parallelizing, once the set of segments are sorted by min element, you can divide in n part the set of segments and doing the merge independently. Then only merge at the border of each parts. But as @MikeNakis says in this comment as merging is quite memory bound, they might not be well parallelize
Explore related questions
See similar questions with these tags.
std::is_sorted
( link ). I don't have any mutator methods per se, I justpush_back
directly to the publicindices
vector (e.g.newSegment[j]->indices.push_back(i)
). I do onlypush_back
toindices
in only two other places in the code, so I couldstd::sort
after each one, as well as sort when I combine them in this algorithm, and then delete theis_sorted
/sort
when accessing.