I'm looking for an idea to efficiently solve following problem:
I have a set of pairs of ranges (range = a pair of numbers), each range is unique (but has same size) e.g.
[
[(0,6),(34,40)],
[(1,7),(35,41)],
[(3,9),(12,18)],
[(2,8),(36,42)],
[(13,19),(22,28)],
[(23,29),(14,20)]
]
Now I'd like to combine pairs of ranges, if ranges are overlapping e.g.
[(0,6),(34,40)] overlaps with [(1,7),(35,41)] -result-> [(0,7),(34,41)]
So as a result for above set I'd like to get (now each pair may have ranges of different size)
[
[(0,8),(34,42)],
[(3,9),(12,18)],
[(13,20),(22,29)]
]
The set might be pretty big, I'd like to avoid quadratic complexity if possible.
EDIT: My best idea so far (in Python) is below. I'd like to know if you know a better(faster) way. Also I'm not sure if my idea of removing already combined pairs is valid:
def ranges_overlap(range1, range2):
return range1[0] < range2[1] and range2[0] < range1[1]
def combine_pairs(pair1, pair2):
return [(min(r1[0], r2[0]), max(r1[1], r2[1])) for r1, r2 in zip(pair1, pair2)]
def combine_overlapping_pairs(pairs):
combined = []
while pairs:
pair1 = pairs.pop()
already_combined = []
for pair2 in pairs:
for pair2_perm in itertools.permutations(pair2):
does_overlap = True
for range1, range2 in zip(pair1, pair2_perm):
if not ranges_overlap(range1, range2):
does_overlap = False
break
if does_overlap:
pair1 = combine_pairs(pair1, pair2_perm)
already_combined.append(pair2)
break
combined.append(pair1)
# Not sure if I can do that
for pair in already_combined:
pairs.remove(pair)
return combined
2 Answers 2
Third time's the charm, I hope. I combined my original idea to sort the list first with Snowman's suggestion of using set theory. The basic algorithm is:
- Sort the pairs in lexicographic order by the first range.
- Group adjacent pairs by if just the first ranges overlap. I prove below that lexicographically sorted ranges will never overlap without being adjacent, which allows this step to be done in
O(n)
. This creates sets where the first ranges overlap. - Within each of those sets, sort again by the second range, and group by if the second ranges overlap. This splits sets where the first ranges overlap into sets where the second ranges also overlap.
- Within each set of sets, repeatedly combine pairs until you have one merged pair.
Here's an O(n log n)
implementation in Haskell:
import Data.List
import Data.Tuple
type Range = (Int, Int)
type RangePair = (Range, Range)
rangesOverlap :: Range -> Range -> Bool
rangesOverlap (a,b) (c,d) =
c <= a && a <= d ||
c <= b && b <= d ||
a <= c && c <= b ||
a <= d && d <= b
pairsOverlap :: RangePair -> RangePair -> Bool
pairsOverlap (a, b) (c, d) = rangesOverlap a c
combineRanges :: Range -> Range -> Range
combineRanges (a,b) (c,d) = (min a c, max b d)
combinePairs :: RangePair -> RangePair -> RangePair
combinePairs (a, b) (c, d) = (combineRanges a c, combineRanges b d)
combineOverlapping :: [RangePair] -> [RangePair]
combineOverlapping = map swap . concat .
map (combineSets. makeSet . map swap) . makeSet
where makeSet = groupBy pairsOverlap . sort
combineSets = map (foldl1 combinePairs)
We wish to prove that if the ranges are lexicographically sorted, all the overlapping ranges will occur in adjacent rows. Assume a counterexample exists where overlapping ranges are not in adjacent rows. It will take the form of:
(a, b)
(c, d)
(e, f)
where (a,b)
and (e,f)
overlap with each other, but (c,d)
does not overlap with either.
In order for (c,d)
not to overlap with (a,b)
, d
must be less than a
or b
must be less than c
. c <= d
because it is a range, and a <= c
because of
lexicographical sorting, therefore d >= a
. e <= b
because of overlapping,
and c <= e
because of lexicographical sorting, therefore c <= b
. Therefore,
if (a,b)
and (e,f)
overlap, then (a,b)
and (c,d)
always overlap, and no
counterexample exists.
-
1Not quite. [(1,3),(8,10)],[(2,4),(6,7)],[(3,4),(9,11)] - the first and third overlap but the first and second doesn't even though they're in sorted order.slebetman– slebetman2014年09月08日 22:16:04 +00:00Commented Sep 8, 2014 at 22:16
-
That's an excellent catch. I think there's still a solution in there somewhere if you run it twice, once sorted on the first tuple and once on the second. The merge is tricky, though. I'll ponder on it.Karl Bielefeldt– Karl Bielefeldt2014年09月08日 22:31:02 +00:00Commented Sep 8, 2014 at 22:31
-
Hrmm... This sorting thing feels like a trap. A hunch tells me that the two pairs may need to be treated as independent dimensions, so a topological sort or something may be needed to fully make it go...J Trana– J Trana2014年09月09日 01:47:12 +00:00Commented Sep 9, 2014 at 1:47
-
See my new answer, I believe looking at this using set theory will produce more accurate results, but the performance might not be optimal.user22815– user228152014年09月09日 03:34:14 +00:00Commented Sep 9, 2014 at 3:34
-
He already knows how to solve it quadratically. You have to sort it somehow to do better.Karl Bielefeldt– Karl Bielefeldt2014年09月09日 03:57:46 +00:00Commented Sep 9, 2014 at 3:57
I will try my hand at an answer, and fully expect to hear what's wrong with it soon. :)
So first, I think the problem may be more easily tackled by thinking of the pair of ranges instead as defining a rectangle. Suppose you have [(0,2), (3,4)]. Another way to view this is the rectangle in Cartesian coordinates of (0,3) to (2,4). Overlap of both ranges can then be thought of as rectangle-to-rectangle intersection.
With that in mind, I think a spatial structure is possibly most appropriate. The path my tired brain can conjure up is to use an R-tree or one of its variants. Finding the union of rectangles then becomes similar to building the tree. At each step, use the rectangle being added first as a search. If it finds any rectangles, replace all found rectangles with the "union" rectangle - the min/max of x/y coordinates, else add the rectangle as is. Once the tree is built, simply iterate through the rectangles and voila, convert them back to your ranges.
The unfortunate reality is that the runtime is not going to be easily analyzable. Let's note some of the potential issues:
- R-trees themselves do not guarantee good worst-case performance (as noted by Wikipedia)
- When adding a new rectangle, the number of overlapping rectangles plays a role in the complexity of the step. This seems to imply density is a sneaky factor.
- R-trees are likely complex enough that any constant factor is probably pretty high, which may mean that other methods win out depending on your data set size.
Given a large enough set, I believe this approach would win out. I believe it to be quadratic only in "bad" cases. But R-trees are not simple data structures, so it needs to be thought through carefully. Good luck!
-
With me being not familiar with R-tress and you writing these are not simple data structures, makes me think that this is too complicated for me. Maybe I'll try it when I understand these structures. Thanks for an answer!miszczu– miszczu2014年09月09日 16:02:52 +00:00Commented Sep 9, 2014 at 16:02
-
Ah, well don't be scared by R-trees. I meant more that they're not simple on the inside but not that the API to use them isn't simple. I'm not a Python guy but this didn't look too hard if you're ever inclined to give it a shot.J Trana– J Trana2014年09月10日 05:08:30 +00:00Commented Sep 10, 2014 at 5:08
(0,8)
and(3,9)
combine? Do both pairs have to overlap?